|Home | About | Journals | Submit | Contact Us | Français|
The human Y-chromosome does not recombine across its male-specific part and is therefore an excellent marker of human migrations. It also plays an important role in male fertility. However, its evolution is difficult to fully understand because of repetitive sequences, inverted repeats and the potentially large role of gene conversion. Here we perform an evolutionary analysis of 62 Y-chromosomes of Danish descent sequenced using a wide range of library insert sizes and high coverage, thus allowing large regions of these chromosomes to be well assembled. These include 17 father-son pairs, which we use to validate variation calling. Using a recent method that can integrate variants based on both mapping and de novo assembly, we genotype 10898 SNVs and 2903 indels (max length of 27241 bp) in our sample and show by father-son concordance and experimental validation that the non-recurrent SNP and indel variation on the Y chromosome tree is called very accurately. This includes variation called in a 0.9 Mb centromeric heterochromatic region, which is by far the most variable in the Y chromosome. Among the variation is also longer sequence-stretches not present in the reference genome but shared with the chimpanzee Y chromosome. We analyzed 2.7 Mb of large inverted repeats (palindromes) for variation patterns among the two palindrome arms and identified 603 mutation and 416 gene conversions events. We find clear evidence for GC-biased gene conversion in the palindromes (and a balancing AT mutation bias), but irrespective of this, also a strong bias towards gene conversion towards the ancestral state, suggesting that palindromic gene conversion may alleviate Muller’s ratchet. Finally, we also find a large number of large-scale gene duplications and deletions in the palindromic regions (at least 24) and find that such events can consist of complex combinations of simultaneous insertions and deletions of long stretches of the Y chromosome.
The Y chromosome is extraordinary in many respects; it is non-recombining along most of its length, it carries many testis-expressed genes that are often found in palindromes and thus in several copies, and it is generally highly repetitive with very few unique genes. Its evolutionary process is not well understood in general because short-read mapping in such complex sequence is difficult. We combine de novo assembly and mapping to investigate evolution in more than 60% of the length of 62 Y chromosomes of Danish descent. We find that Y chromosome evolution is very dynamic even among the set of closely related Y chromosomes in Denmark with many cases of complex duplications and deletions of large regions including whole genes, clear evidence of GC-biased gene conversion in the palindromes and a tendency for gene conversion to revert mutations to their ancestral state.
The human Y chromosome shares around 3 Mb of sequence with the X chromosome—the telomeric pseudoautosomal regions, where recombination occurs. The remainder of the Y chromosome, the male-specific region (MSY) is inherited from father to son without recombination and its evolution therefore reflects mutation, drift, and selection in males and follows a single phylogenetic tree. The MSY consists of interspersed regions of different origins. 1) The X-degenerate regions are directly descended from the chromosome pair that became the sex chromosomes ~180 Mya (million years ago) and have retained 16 genes homologous to X genes, 12 of which are single copy , which are thought to be needed for dosage reasons [2,3]. 2) The X-transposed region is unique to humans and originated by a duplication event from the X chromosome (Xq21) approximately 3–4 million years ago . It is now ~99% identical to the homologous region on the X chromosome. 3) The ampliconic regions are all highly repetitive and contain palindromes (inverted repeats) as large as 1.5 Mb with >99.9% similarity between arms. The genes in these regions are thus present in multiple copies and are predominantly expressed in testes . They are thought to affect fertility [4–6], and they potentially also engage in an arms race with genes in similar ampliconic regions on the X chromosome for transmission to the sperm cells during male spermatogenesis [7–9]. Another fascinating feature of the palindromes is that they exchange genetic material between the two arms via non-allelic homologous recombination (NAHR). One effect is that it reduces the divergence between the arms [10,11] and offers a potential way of repairing deleterious mutations, thus escaping Muller’s ratchet. Another effect of NAHR is allowing palindrome arms to exchange material across different palindromes leading to large structural and copy number variants .
While these features make the MSY interesting, they also make it difficult to study.
Mapping of short reads to the reference genome (Next generation sequencing or NGS) has been used to survey single nucleotide variants (SNVs), small indels and copy number variants on the Y chromosome [13–15]. However, NGS has limited power when calling variants in highly similar regions such as the ampliconic regions or in the heterochromatic regions, indels that exceed the read length, large chromosomal rearrangements and variants with sequence not present in the reference.
To investigate variation in highly similar regions and potentially find novel sequence on the Y chromosome, de novo assemblies show promising results. High-quality assemblies have been constructed for the chimpanzee  and the human  Y chromosomes, and recently using chromosome sorting and hybrid assemblies also for the gorilla Y chromosome . However, these approaches do not scale well to many individuals from the same species.
The Danish Pan Genome Project  provided us with short-read paired-end sequences from libraries with insert sizes ranging from 180 bp to 20 kb for a total of 40X coverage for the Y chromosome. Here we use these data to construct hybrid assemblies for 62 males, including 17 father-son pairs, where we can use concordance of variant calls as a quality measure. From hybrid assemblies and traditional read-mapping we here report a detailed analysis of SNV and structural variation, including standing variation in copy number variants, evolutionary dynamics of the palindromes and estimation of mutation rates.
We constructed whole-genome hybrid assemblies for the 68 males reported in . We extracted the scaffolds that mapped to the MSY and used them to call variants with respect to the reference genome (GRCh38). We excluded scaffolds that mapped ambiguously to the X and the Y chromosome in the X-transposed region. Six individuals (all fathers) produced poor assemblies, and we excluded these from further analysis (S1 Fig). We therefore based our analysis on the remaining 62 Y chromosomes (27 fathers, one son and 17 father-son pairs).
We were able to recover large scaffolds that on average cover 86.7% of the X-degenerate regions and 41.8% of the ampliconic regions. The reason for the lower coverage of the ampliconic regions is that they mainly consist of palindromes with too high inter-arm similarity to be assembled. In the hybrid assemblies, the palindrome arms are collapsed into one arm with twice the coverage of the scaffolds mapping to the X-degenerate part of the Y chromosome. We show below how they can be partly de-collapsed.
The scaffolds have an N50 with a mean of 1.29 Mb and median of 1.42 Mb among the individual assemblies. The contigs have a mean and median N50 of 40 Kb. Sequence gaps (patches with Ns) constitute less than 4% and are typically found in blocks (min = 1 bp, max = 20146 bp, mean = 1335.2, bp median = 384.5 bp). Fig 1 shows a dot plot of the scaffolds of one individual to the reference genome. The repetitive nature and collapsed palindromes can be seen as departures from the diagonal.
We evaluated the quality of the scaffolds using alignment between father and son in the seventeen father-son pairs. Differences between fathers and sons can be caused by assembly errors, alignment errors, or de novo mutations. To assess the quality of the scaffolds, we excluded repeat masked regions (we excluded 6.9 Mb out of 11 Mb aligned base pairs on average) because they contain low complexity repeats and are therefore prone to alignment errors. For the remaining parts of the scaffolds, any differences between father and son scaffold should mainly be due to assembly errors. The concordance rates between fathers and sons, separated into regions, are also shown in Fig 1. The concordance rates were highest in the X-degenerate regions and lowest in the ampliconic regions but typically, we found between 0.1 and 10 differences in 10,000 bp.
We compiled a candidate set of indels and SNVs combining two distinct approaches. First, we called SNVs and indels using traditional mapping approaches (BWA-MEM and GATK haplotype-caller module, henceforth referred to as GATK-HC)[19,20]. Second, we called indels directly from the hybrid assemblies using AsmVar , which aligns scaffolds to a reference sequence using LAST  and then finds differences. We then genotyped the merged set of candidate variants using BayesTyper , which assigns genotype probabilities for each variant, in each individual, based on the k-mer footprint of the variants and the k-mer distribution in the raw reads. We labelled variants already in dbSNP142, 1000 Y  or 1000 genomes phase 3  as known and show both the number of variants and the proportion of these that are already known in Table 1. Variants are denoted complex if they cannot be changed into the reference by a single deletion or insertion event. We provide a full list of variants in Supplementary file S2 Dataset.
We used SNVs in the X-degenerate region (3126 in total) to construct a phylogenetic tree for the Y chromosomes of the 62 individuals using neighbor joining (NJ) and we grouped the individuals based on which mutations they shared. SNV-defined haplotypes (haplogroups) for each family are shown in S5 Dataset. (see S2 Fig; almost identical results were found using maximum likelihood (ML), apart from one individual of haplotype I1a1b1, which did not group with individuals of I1a1b in ML). We divided variants into those occurring once in the phylogeny (non-recurrent) and those occurring multiple times in the phylogeny (recurrent). We report variants for the heterochromatic region separately because the SNV density is much higher here.
A summary of the number of variants is shown in Table 1. We find a very high father-son concordance for non-recurrent variants whereas only about 2/3 of the recurrent variants are concordant among father-son pairs suggesting a rather high false discovery rate for these. There are two primary reasons for this low concordance rate in the recurrent variants. 1) They often occur at positions with multiple similar variants, for instance the reference is C and the alternative alleles are CA,CAAA,CAAAA,CAA. 2) The flanking sequence up and downstream of the variant is very similar to other parts of the genome. Both will not give unique k-mers and therefore the genotyper has difficulties determining the genotype.
We experimentally validated a set of random large indels by Sanger sequencing. We chose to validate these because they had a lower concordance rate than the SNVs. Within the set of non-recurrent variants, 43 out of 43 validated experimentally (29 deletions and 14 insertions) yielding a validation rate of 100%. For recurrent variants, 16 were validated out of 20 yielding a validation rate of 80%. Both are in line with the in silico validation rates. Details on the validation results are presented in supplementary file S2 Dataset.
Fig 2 breaks down the variant set into size classes, call set origin and shows the difference between Y chromosomes along the Y chromosome. The proportion of the variants that have not been observed before (novel) increases with the size of the indels. While the majority of known indels below 15 bp in length are identified by both methods, mapping-based assemblies identifies more short novel variants. The opposite is true for variants larger than 15 bp, especially for insertions where most were identified from the hybrid assemblies.
Even though the Y chromosomes studied here belongs to common European haplogroups (R and I) and haplogroup Q we identify 29 novel variants present in all haplogroup I individuals, 66 novel variants present in all Q individuals and 1 novel variant present in all Rs. We also found 174 Insertions and 104 deletions in all individuals, meaning that this sequence has likely been lost or gained in the reference Y chromosome. We used BLAST  to investigate if insertions above 500 bp have similarities to other known sequences. One variant found in all haplogroup I individuals is a 3326 bp insertion that shares 98% identity to a segment on the chimpanzee Y chromosome and thus must have been lost in the lineage leading to the reference Y chromosome that is R1b1.
Because a single phylogeny can be constructed for Y chromosomes, we can estimate how many generations are spanned by the tree using the number of X-degenerate mutations and a X-degenerate specific mutation rate of 3.14E-8, which was estimated based on resequencing of the Y chromosomes of 753 genealogically-connected Icelandic males spanning a total of 47,123 years . We use the number of variants and the length of the callable region to give estimates of the rate at which different variants occur in Table 2. It is apparent that the ampliconic substitution rate is smaller than the rate for the X-degenerate region (discussed further below) and that the estimated rate for the heterochromatic region is much higher than for the rest of the Y chromosome. We note that other calibrations could also be used, e.g. using the rate from  of 2.15E-8 (assuming 29 years per generation) would reduce all our estimates by 33% but keep the same relative differences among types of variation and among genomic regions. We report the rate in mutation per position per generation (PPPG).
Despite the very high estimated substitution rate for the centromeric heterochromatic region, we observe the same high in silico validation rate suggesting that this region is indeed extraordinarily polymorphic and not subject to a higher false positive rate.
To investigate possible reasons for the extraordinary polymorphism, we searched for homology of sequence in this region with other parts of the genome. Using BLAT in sliding non-overlapping windows of 50 kb of the centromeric heterochromatic region (10 Mb to 11.7 Mb) we found that 684 kb of the 1.7 Mb had windows of 10 kb with more than 96% similarity to sequence fragments from other chromosomes (see supplementary file S2 Dataset for details). These chromosomes include regions close to the centromere on chromosomes 21, 9, 22, 2 and 16, plus uncharacterized fragments such as Un_GL000218v1. The full list is given in S2 Dataset.
The two arms of the palindromes are collapsed in the hybrid assemblies due to their high sequence similarity. However, we can investigate palindrome dynamics with respect to mutations and gene conversions using a similar approach to that of another study . The approach is to map reads to one palindrome arm so that differences between arms will appear as pseudo-heterozygous SNV. We mapped reads to all proximal arms of the 8 palindromes and focused on regions with twice the coverage as the X-degenerate region. We therefore removed r1, r2, r3, r4, g1, g2, g3, b1, b2, b3, and b4, because these are present more than twice, retaining 2.7 Mb of sequence (see Fig 3 for naming of segments).
To find mutations and gene conversions, we inferred the ancestral state of each node in the phylogeny based on X-degenerate SNVs. We inferred the ancestral state of the haplogroups present in this study (R,I and Q) using individuals ERR1395549, ERR1347702 and ERR1395593 from the Simons Genome Diversity Project , which belong to haplogroups H2b, C2a and E1b, respectively as outgroups. S4 Fig shows the evolutionary relationship between the haplogroups in this study and the haplogroups used as outgroups. We use the estimated split times between haplogroups from .
To identify mutation events, we searched for positions where the pseudo-genotype of the parent node was homozygous and the child node was heterozygous. This would mean that a mutation occurred on the branch between the parent and child node, changing one of the bases in one of the palindrome arms making the position look like a pseudo-heterozygous allele. In Fig 3B on the far right, this would correspond to position 2 in individual 1.
To identify gene conversion events, we searched for positions where the pseudo-genotype of the parent node was heterozygous and the child node was homozygous. In Fig 3B, on the far right, this would correspond to positions three and four in the second individual.
We successfully identify 603 mutations and 416 gene conversions in 696 positions across the 7 palindromes. This is a major increase compared to previous studies of 10 positions  and 3 positions  and allow us to quantify rates and types of mutations and gene conversions. We find that there is a bias towards converting bases into their ancestral state (Fig 3C). We find 100 gene conversions that converted the ancestral to the derived and 171 that converted the derived to the ancestral which is statistically significant (p = 1.61e-05 Chi-square test). In the remaining cases the ancestral genotype was inferred to be heterozygous. We find that gene conversions also show a bias towards GC base pairs with 259 GC conversions and 158 AT conversions which is statistically significant (p = 7.58e-07 Chi-square test). We also have the opportunity to study new mutations within the recent history of the palindromes. We find a mutation bias towards AT base pairs with 336 mutations towards AT bases and 267 mutations towards GC bases (56% versus 44%) which is statistically significant (p = 4.96e-03 Chi-square test). This ratio is similar to what we find when we look at all mutations along the Y chromosome (54% versus 46%). To investigate whether this ratio is affected by gene conversion, we also looked at more recent mutations, private to one individual or one family. We find 186 mutations to AT base pairs and 157 mutations to GC base pairs, which is similar to the rate of all mutations (54% versus 46%), but it is not statistically significant (p = 0.11 Chi-square test) due to the low sample size.
Multiple adjacent gene conversions in an individual can either be explained by many independent gene conversions or a single large gene conversion. The latter would be the most parsimonious. In individual 623–01 we find that almost all pseudoheterozygous positions have been converted into pseudohomozygous positions (see Fig 3E). This suggest that y1 or y2 of palindrome 1 has been deleted and the drop in coverage from ~60X to ~30X supports this finding. In the 1113 family, we find a 150 kb segment where all pseudoheterozygous positions have been converted to pseudohomozygous positions (see Fig 3E). We also find that the coverage has increased in this region from ~60X to ~90X. This suggest that part of one arm has replaced the other arm in a large gene conversion event and then been copied yet again, yielding three identical copies of the segment. This emphasizes that very complex rearrangements of the Y chromosome occur.
To estimate the rate of gene conversions and mutations we used the total length of the palindrome sequence that was analyzed (2.7 Mb) and the generations spanned by the tree found using the X-degenerate SNPs. We find that the gene conversion rate is 1.21E-8 events per position per genome and the mutation rate is 1.76E-8 events per position per generation. The gene conversion rate fits well another study  where they estimate it to be between 7.25E-9 events per position per generation and 2.10E-8 events per position per generation. The mutation rate is, however, lower than what was found in another study, which was 2.86E-8 events per position per generation .
To identify copy number variation, we mapped all sequence reads to one copy of each of the 24 distinct genes on the MSY except the X-transposed region. We normalized coverage to 1 Mb of X-degenerate region sequence and then estimated the copy number from the median read coverage along each gene compared to the 1 Mb region. For copy numbers in fathers and sons to be considered concordant we required normalized coverage to differ less than 0.5. We find that the concordance between the independently estimated copy numbers for father and son is 97.8% (399 matches of 408 pairs). All cases of non-concordance were found in TSPY and VCY with 5 and 4 dis-concordant father-son pairs respectively. There was never more than one copy number difference in these genes. Fig 4 summarizes the results based on the raw data found in supplementary file S4 Dataset.
The RBMY1A1 gene changes copy number 15 times in the phylogeny. If we assume that there were 9 copies in the ancestor, we observe 8 independent deletions and 7 independent duplications.
Due to our inability to estimate TSPY copy numbers more precisely than to within one copy, we only count changes by more than 1 copy number from the rounded mean in the haplogroup, which was 22 for R1b, R1a and I1a, 23 for Q1a and 21 for I2a. Using this conservative approach, we find that TSPY changes copy number 5 times in our phylogeny. Many of the gene duplication and deletion events are probably linked due to their close proximity in the palindrome arms. It could also be possible that multiple independent gene duplications and gene losses have occurred, but given that there are already examples of duplication and deletions that include entire palindrome arms , it is more parsimonious that the events are linked.
We find an event that could be the known gr1/gr2 deletion where palindrome 2 and part of palindrome 1 is deleted which leads to a loss of copy of BPY2 and CDY, three exons in PRY and two copies of DAZ in individual 623–01 . We also found that this individual had half the coverage in the y1/y2 part of palindrome 1 compared to individuals in the same haplogroups, and very few pseudodiploid SNVs in the y1/y2 part of palindrome 1. All of this points towards a deletion of r1, r2, b3 and y1 for this individual.
We find higher copy numbers for all exons of PRY in individual 890–01. Since only exons 3, 4 and 5 are present in b3 and b4, the most likely explanation is a b1 or b2 duplication. We find a gr1/gr2 duplication event in 995–01 with one extra copy of CDY, BPY2, higher copy number for exons 3,4 and 5 in PRY and two copies of DAZ. Lastly, we find a novel duplication of part of palindrome 5 arm in a family 1113 leading to extra copies of CDY and XRKY. Only part of palindrome 5 is lacking pseudodiploid positions in this family, and this suggests that only a part of the arm has been duplicated. Most of these events are probably due to NAHR events, due to their presence in palindrome arms.
We have shown that it is possible to assemble large parts of the human Y chromosome from ~40X short reads from multiple insert size libraries. Regions with high similarity to other chromosomes (X-transposed) do not have Y chromosome specific scaffolds, showing that the hybrid assembly approach cannot reliably distinguish regions that are more similar than 99%. This is also the reason why the palindrome arms are collapsed in the scaffolds.
The problem with collapsed palindromes was also a problem in the recent Gorilla Y chromosomes assembly . To obtain fully assembled amplicons, methods like SHIMS  or very long reads from third generation sequencing methods are needed.
The availability of hybrid assemblies allows for indels larger than the read length to be identified, including large segments that have been deleted in the reference sequence. The 3326 bp insertion we find in haplogroup I, which shares 98% identity to the chimpanzee, has likely been in all Y chromosome haplogroups until recently, when it was deleted in haplogroups R and Q; thus it is missing from the reference genome and has therefore been missed by 1000Y or similar mapping-based initiatives. The heterochromatic regions are not usually included in analysis of the Y, but our high father-son concordance rate in this region suggest that it is should be included. Our results imply either that the point mutation rate in this region is up to an order of magnitude higher than the rest of the Y chromosome or that genetic variation has been introduced by non-homologous gene conversions from other chromosomes. The high similarity to other chromosomes in this region might suggest a rich history of transposition and a previous study found interchromosomal duplication events within 5 MB from the centromere on the Y chromosome and within 5 Mb of the centromere on chromosome 9 and 16 among others . Since all of the variation is not shared between all individuals we would require multiple interchromosomal gene duplications to explain our data. We conclude that the observed 10-fold higher rate of polymorphism merits further studies to establish the mechanism.
We have, for the first time, inferred mutation and gene conversion events across all unique palindromes, which allowed us to investigate the gene conversions and mutation process in detail. We present evidence both of GC-biased gene conversion and conversion towards the ancestral state as suggested in previous much smaller studies [10,11]. The mutation process is biased towards AT, as in the rest of the genome, leading to a dynamic equilibrium where gene conversion is more likely to repair towards the ancestral state. We also find that in addition to this, gene conversion seems to favor the ancestral state even for mutations that do not change the GC content (27 gene conversions towards ancestral vs 17 gene conversions towards derived). Taken together these two effects slow down the evolution of palindrome sequence, explaining why we infer a lower mutation rate than in the X degenerate region (see also Helgason et al. 2015).
Previous studies have shown that lower TSPY copy number and gr/gr deletions can increase the risk of spermatogenic failure [5,31]. However, all individuals in this study have no known diseases and all fathers produced sons or daughters, so the change in copy numbers does not seem to have a severe effect.
Our approach for identifying copy number of genes has limitations. The approach cannot distinguish between functional genes and pseudogenes as in the case with RBMY1A, which has 6 functional genes present in the reference sequence  and an unknown number of pseudogenes. This means that the copy number changes we observe could just as well be pseudogenes, which might not have an impact on an individual. To solve this problem, one would need mRNA expression data as well.
Moreover this method is not good at differentiating between copy numbers when they are very large, as in the case with TSPY where almost all our individuals have fewer copies than found in a previous study which used PmeI pulsed-field DNA blotting (between 26–28 copies in the haplogroups R,I and Q) . This method also works best with uniform coverage across the gene but some genes contain repeats as in DAZ .
We constructed de novo assemblies with ALLPATHS-LG  using 7 different library insert sizes of 180, 500, 800, 2000, 5000, 10000 and 20000 bp see . The scaffolds mapping to the Y chromosome when using LAST  were found.
In order for a scaffold to be kept it was required that the majority of the fragments were aligned to the Y chromosome and aligned with more than 1000 bp. This will remove fragments that align to the pseudo-autosomal region and X-transposed region and small scaffolds that align ambiguously across the genome. For closing the gaps within the scaffolds a program from SOAPdenovo2 called GapCloser  was used using reads from each of the 7 libraries. The first cycle started with the 180 bp insert library, the second with the 500 bp insert size and so on. The repeats in the scaffolds were masked using repeatmasker .
The scaffolds were sorted based on which order they mapped to the reference and reverse complemented if they mapped to the reverse strand.
For each father son pair, the Y chromosome was broken down into regions (1st ampliconic region. 2nd ampliconic region. 1st X-degenerate region. 2nd X-degenerate region and so on) and scaffolds mapping to each region were extracted for each individual. The scaffolds were aligned using MAFFT  and regions identified by repeatmasker as repetitive were masked. Mismatches within 50 bp of alignment gaps were also masked. The concordance rate was reported for windows of 10 kb.
Reads were mapped to the reference (GRCh38) using BWA-MEM version 0.7.5a  and refined by Stampy version 1.0.23  both with standard parameters. GATK Haplotype caller version 3.2–2 was used for finding variants . AsmVar was used for finding indels . This was run using the gapclosed scaffolds.
The variants called with GATK and AsmVar was merged based on position, reference and alternative allele and genotyped using BayesTyper . BayesTyper is a probabilistic framework for genotyping variants, that constructs a variant graph using all the input variants and the reference sequence. A library of k-mers of size 55 were constructed for all the reads coming from an individual. BayesTyper then checks how well a path through the variant graph is supported by the k-mers for an individual. The genotypes for all individuals are estimated jointly and only variants with a posterior probability greater than 0.9 with more than 3 k-mers supporting it were kept.
We randomly selected 27 non-recurrent variants and 23 non-recurrent variants. For each variant, we picked one individual with the reference allele and one with the alternative. If the correct allele was present and we could sequence 50 bp up and downstream of the variant, we called the variant as validated.
The SNPs called using GATK were used for constructing the neighbor joining (NJ) tree. The SNPs were required to have a filter status of PASS, not be recurrent and they need to be in the X-degenerate region. The NJ tree was constructed using MEGA 6  using the number of substitutions as the model and pairwise deletion as missing data treatment. It was run with 500 bootstrap replicates.
Haplogroups were called with respect to a minimal list of SNPs  and the ISOGG database (International Society of Genetic Genealogy), Y-chromosome phylogeny, Y-DNA Haplogroup Tree 2016, Version: 11.239, Date: 2 September 2016, http://www.isogg.org/tree/.
If the haplogroup could not be identified the individual was assigned the haplogroup of the other individuals that it clusters with in the Neighbor joining tree.
The reads of each individual were mapped to all palindrome proximal arms using BWA-mem and filtered and sorted with Sambamba version 0.5.1 .
Reads were filtered away with Sambamba using the following criteria: "not (duplicate or secondary_alignment or unmapped) and mapping_quality > = 50 and cigar = ~ /100M/ and [NM] < 2". The reads cannot be duplicate, in secondary alignments or unmapped. Furthermore, the quality of the reads must be above 50, no insertions must be in the read and the number of mismatches in a read must be below 2. In addition the mate-paired reads above 10000 bp were not filtered with the ‘proper pair ‘ option.
SNPs were then called using platypus  with standard parameters other than “—maxReads 800000000—maxVariants 20”.
The coverage of each position was calculated using Samtools version 0.1.19 .
The SNPs were then filtered by requiring that each position had a depth of coverage between 50 and 250. Finally, the SNPs were phased using GATK with the parameters:
“—phaseQualityThresh 20.0—fix_misencoded_quality_scores -fixMisencodedQuals”.
The number of events per position per generation (PPPG) for insertions, deletions, complex variants, STR mutations in the palindrome and gene conversions in the palindrome were calculated by dividing the number of events by the generations spanned by the tree and the length of the segment analyzed.
The total number of generations was calculated from the SNV mutation rate of 3.14 10−8 PPPG  for 3126 SNVs called in 8.1 Mb of the X-degenerate region. The number of generations is estimated to be:
For mutations in palindromes there were 606 events and the length of the palindrome arms called was 2*1.4 Mb and the number of generations spanned by the tree was 12265 This means that the number of event PPPG was .
In order to call gene conversion and mutation events in all palindromes we devised the following algorithm. First a phylogeny of all the samples must exist. We start with all the observed genotypes for all individuals in the tree. Then we do a bottom-up filling out of ancestral nodes based on their children. Next, we do a top-down assignment of events if the parent is different for one of the children. The method is illustrated in S3 Fig.
The coding sequence +/- 2 kb up- and downstream of 26 protein-coding genes on the Y chromosome was found using http://www.ncbi.nlm.nih.gov/ along with 1 Mb of X-degenerate region. The raw reads were mapped to these genes for each individual with BWA-mem and filtered using Sambamba with the parameters mentioned above.
CNVnator  was used to call duplications and deletions using a binsize of 100 bp and GC correction. CNVnator was used in the whole gene except for DAZ (only exon 28 was used–chrY: 23198797–23199094), PRY (exons 1 and 2, chrY: 22490396–22490484 and chrY: 2490585–22490672) and TSPY (FAM197Y2P –chrY: 9479053–9484654) was used.
This study has been approved of Den Nationale Videnskabsetiske komité (The Danish national committee on health research ethics) with approval number 1210920. All individuals participating have provided informed written consent.
Individual sequence data, alignment based assemblies for all 62 individuals in bam file format are available at the European Genome-phenome Archive (EGA), which is hosted by the EBI, under accession number EGAS00001002108. The variants used in this project has accession number EGAD00001003186. All other data, except the de novo assemblies, are within the paper and its Supporting Information files.
Due to patient confidentiality, access to the individual de novo assemblies can be accessed through agreement with the either Prof. Søren Brunak: email@example.com, Prof. Mikkel Schierup: kd.ua.crib@ediehm or Prof. Karsten Kristiansen: kd.uk.oib@kk.
This spreadsheet contains 3 tabs; Scaffold statistics, Regions of the Y chromosome that lists the different reigons of the Y chromosome and Alignment to reference.
This spreadsheet contains 6 tabs; Variants that contains all called variants, Variants summary, Heterochrom BLAT that shows the similarity to other chromosomes of segments in the centromeric heterochromatic region, Validation results that show the Sanger validation for structural variants, Mutation rates, BLAST for large insertions.
This spreadsheet contains two tabs; Length of palindrome arms used and Events.
This spreadsheet contains two tabs; Copy number–not rounded and genes used.
This spreadsheet contains one tab; Haplogroups.
Neighbor joining vs. Maximum likelihood.
We thank Pille Hallast, Andrea Massaia and Chris Tyler-Smith for their helpful comments on our manuscript and Jonas Andreas Sibbesen and Lasse Maretty for their help with the BayesTyper analysis.
Danish Pan Genome Consortium:
Lasse Maretty1, Jacob Malte Jensen2,3, Bent Petersen4, Jonas Andreas Sibbesen1, Siyang Liu1,5, Palle Villesen2,3,6, Laurits Skov2,3, Kirstine Belling4, Christian Theil Have7, Jose Izarzugaza4, Marie Grosjean4, Jette Bork-Jensen7, Jakob Grove3,8, Thomas D. Als3,8, Shujia Huang5,9, Yuqi Chang10, Ruiqi Xu5, Weijian Ye5, Junhua Rao5, Xiaosen Guo10, Jihua Sun5,7, Hongzhi Cao10, Chen Ye10, Johan v Beusekom4, Thomas Espeseth11, Esben Flindt4, Rune M Friborg2,3, Anders E. Halager2,3, Stephanie Le Hellard11,12, Christina M Hultman13, Francesco Lescai3,8, Shengting Li3,8, Ole Lund4, Peter Løngren4, Thomas Mailund2,3, Maria Luisa Matey-Hernandez4, Ole Mors3,6, Christian NS Pedersen2,3, Thomas Sicheritz-Pontén4, Patrick Sullivan13,14, Ali Syed4, David Westergaard4, Rachita Yadav4, Ning Li5, Xun Xu10, Torben Hansen7, Anders Krogh1, Lars Bolund8,10, Thorkild IA Sørensen7,15,16, Oluf Pedersen7, Ramneek Gupta4, Simon Rasmussen4, Søren Besenbacher2,6, Anders D. Børglum3,8, Jun Wang3,10, Hans Eiberg17, Karsten Kristiansen10,18, Søren Brunak4,19, Mikkel Heide Schierup2,3,20
1 Bioinformatics Centre, Department of Biology, University of Copenhagen, 2200 Copenhagen N, Denmark
2 Bioinformatics Research Centre, Aarhus University, 8000 Aarhus C, Denmark
3 iSEQ, Centre for Integrative Sequencing, Aarhus University, 8000 Aarhus C, Denmark
4 Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Kemitorvet, 2800 Kongens Lyngby, Denmark
5 BGI-Europe, Ole Maaløes Vej 3, 2200 Copenhagen N, Denmark
6 Department of Clinical Medicine, Aarhus University, 8000 Aarhus C, Denmark
7 Novo Nordisk Foundation Center for Basic Metabolic Research, Section of Metabolic Genetics, University of Copenhagen, 2100 Copenhagen Ø, Denmark
8 Department of Biomedicine, Aarhus University, 8000 Aarhus C, Denmark
9 School of Bioscience and Biotechnology, South China University of Technology, Guangzhou 510006, China
10 BGI-Shenzhen, Shenzhen 518083, China
11 NORMENT, KG Jebsen Centre for Psychosis Research, Department of Clinical Science, University of Bergen, Bergen, 5021, Norway
12 Dr E. Martens Research Group of Biological Psychiatry, Center for Medical Genetics and Molecular Medicine, Haukeland University Hospital, Bergen, 5021, Norway
13 Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, 17177, Sweden
14 Department of Genetics, University of North Carolina, Chapel Hill, NC 27599–7264, USA
15 Institute of Preventive Medicine, Bispebjerg and Frederiksberg Hospital, The Capital Region, Copenhagen, Denmark
16 Department of Public Health, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
17 MRC Integrative Epidemiology Unit, School of Social and Community Medicine, Bristol University, Bristol, UK
18 Department of Cellular and Molecular Medicine, University of Copenhagen, 2200 Copenhagen N, Denmark
19 Laboratory of Genomics and Molecular Biomedicine, Department of Biology, University of Copenhagen, 2100 Copenhagen Ø, Denmark
20 Department of Bioscience, Aarhus University, 8000 Aarhus C, Denmark
This work was supported by the Danish Council for Independent Research grant number: DFF 1323-00076 URL: http://ufm.dk/en/research-and-innovation/councils-and-commissions/the-danish-council-for-independent-research/the-council/the-danish-council-for-independent-research-natural-sciences?set_language=en&cl=en. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Individual sequence data and alignment based assemblies for all 62 individuals in bam file format are available at the European Genome-phenome Archive (EGA), which is hosted by the EBI, under accession number EGAS00001002108. The variants used in this project are available under accession number EGAD00001003186. Due to patient confidentiality, access to the individual de novo assemblies can be accessed through agreement with the corresponding author of Maretty et al (2017) doi:10.1038/nature23264. All other data are within the paper and its Supporting Information files.