In forensic casework, Y chromosome short tandem repeat markers (Y-STRs) are often used to identify a male donor DNA profile in the presence of excess quantities of female DNA, such as is found in many sexual assault investigations. Commercially available Y-STR multiplexes incorporating 12–17 loci are currently used in forensic casework (Promega's PowerPlex® Y and Applied Biosystems' AmpFlSTR® Yfiler®). Despite the robustness of these commercial multiplex Y-STR systems and the ability to discriminate two male individuals in most cases, the coincidence match probabilities between unrelated males are modest compared with the standard set of autosomal STR markers. Hence there is still a need to develop new multiplex systems to supplement these for those cases where additional discriminatory power is desired or where there is a coincidental Y-STR match between potential male participants. Over 400 Y-STR loci have been identified on the Y chromosome. While these have the potential to increase the discrimination potential afforded by the commercially available kits, many have not been well characterized. In the present work, 91 loci were tested for their relative ability to increase the discrimination potential of the commonly used ‘core’ Y-STR loci. The result of this extensive evaluation was the development of an ultra high discrimination (UHD) multiplex DNA typing system that allows for the robust co-amplification of 14 non-core Y-STR loci. Population studies with a mixed African American and American Caucasian sample set (n = 572) indicated that the overall discriminatory potential of the UHD multiplex was superior to all commercial kits tested. The combined use of the UHD multiplex and the Applied Biosystems' AmpFlSTR® Yfiler® kit resulted in 100% discrimination of all individuals within the sample set, which presages its potential to maximally augment currently available forensic casework markers. It could also find applications in human evolutionary genetics and genetic genealogy.
We analysed 67 short tandem repeat polymorphisms from the non-recombining part of the Y-chromosome (Y-STRs), including 49 rarely-studied simple single-copy (ss)Y-STRs and 18 widely-used Y-STRs, in 590 males from 51 populations belonging to 8 worldwide regions (HGDP-CEPH panel). Although autosomal DNA profiling provided no evidence for close relationship, we found 18 Y-STR haplotypes (defined by 67 Y-STRs) that were shared by two to five men in 13 worldwide populations, revealing high and widespread levels of cryptic male relatedness. Maximal (95.9%) haplotype resolution was achieved with the best 25 out of 67 Y-STRs in the global dataset, and with the best 3-16 markers in regional datasets (89.6-100% resolution). From the 49 rarely-studied ssY-STRs, the 25 most informative markers were sufficient to reach the highest possible male lineage differentiation in the global (92.2% resolution), and 3-15 markers in the regional datasets (85.4-100%). Considerably lower haplotype resolutions were obtained with the three commonly-used Y-STR sets (Minimal Haplotype, PowerPlex Y®, and AmpFlSTR® Yfiler®). Six ssY-STRs (DYS481, DYS533, DYS549, DYS570, DYS576 and DYS643) were most informative to supplement the existing Y-STR kits for increasing haplotype resolution, or – together with additional ssY-STRs - as a new set for maximizing male lineage differentiation. Mutation rates of the 49 ssY-STRs were estimated from 403 meiotic transfers in deep-rooted pedigrees, and ranged from ~4.8×10−4 for 31 ssY-STRs with no mutations observed to 1.3×10−2 and 1.5×10−2 for DYS570 and DYS576, respectively, the latter representing the highest mutation rates reported for human Y-STRs so far. Our findings thus demonstrate that ssY-STRs are useful for maximizing global and regional resolution of male lineages, either as a new set, or when added to commonly-used Y-STR sets, and support their application to forensic, genealogical and anthropological studies.
Y-STR; microsatellites; Y-chromosome; haplotype resolution; lineage differentiation; HGDP-CEPH, mutation rates
Koreans are generally considered a Northeast Asian group, thought to be related to Altaic-language-speaking populations. However, recent findings have indicated that the peopling of Korea might have been more complex, involving dual origins from both southern and northern parts of East Asia. To understand the male lineage history of Korea, more data from informative genetic markers from Korea and its surrounding regions are necessary. In this study, 25 Y-chromosome single nucleotide polymorphism markers and 17 Y-chromosome short tandem repeat (Y-STR) loci were genotyped in 1,108 males from several populations in East Asia.
In general, we found East Asian populations to be characterized by male haplogroup homogeneity, showing major Y-chromosomal expansions of haplogroup O-M175 lineages. Interestingly, a high frequency (31.4%) of haplogroup O2b-SRY465 (and its sublineage) is characteristic of male Koreans, whereas the haplogroup distribution elsewhere in East Asian populations is patchy. The ages of the haplogroup O2b-SRY465 lineages (~9,900 years) and the pattern of variation within the lineages suggested an ancient origin in a nearby part of northeastern Asia, followed by an expansion in the vicinity of the Korean Peninsula. In addition, the coalescence time (~4,400 years) for the age of haplogroup O2b1-47z, and its Y-STR diversity, suggest that this lineage probably originated in Korea. Further studies with sufficiently large sample sizes to cover the vast East Asian region and using genomewide genotyping should provide further insights.
These findings are consistent with linguistic, archaeological and historical evidence, which suggest that the direct ancestors of Koreans were proto-Koreans who inhabited the northeastern region of China and the Korean Peninsula during the Neolithic (8,000-1,000 BC) and Bronze (1,500-400 BC) Ages.
The objective of this study was to investigate the quantitative characteristics of short tandem repeat (STR) variations deduced on the basis of the number of STRs that are beneficial for human survival. The longevity group included 60 nonagenarian subjects, and the control group included 250 reference adults (age, 20–50 years). Alleles of 15 Combined DNA Index System STR loci were determined using a commercial polymerase chain reaction kit. An STR with the highest frequency distribution in a population (control group) was considered as a conservative STR, and the number of core unit repeats of this STR allele was considered as the median repeat number in the STR locus (STRm). The absolute difference between the STRm and the number of core unit repeats of other STR alleles can be considered as the quantitative marker of variation for that particular STR allele (M value). The mean M values of CSF1TPO in the longevity group were significantly higher than those in the control group (P < 0.05). These findings appear to suggest that at least one of the STR loci may be associated with longevity. The M value of STR may be a new and high-efficacy genetic marker.
Longevity; STR; Genetic marker; Variation
The Y-chromosomal short tandem repeat (Y-STR) polymorphisms included in the AmpFlSTR® Yfiler® polymerase chain reaction amplification kit have become widely used for forensic and evolutionary applications where a reliable knowledge on mutation properties is necessary for correct data interpretation. Therefore, we investigated the 17 Yfiler Y-STRs in 1,730–1,764 DNA-confirmed father–son pairs per locus and found 84 sequence-confirmed mutations among the 29,792 meiotic transfers covered. Of the 84 mutations, 83 (98.8%) were single-repeat changes and one (1.2%) was a double-repeat change (ratio, 1:0.01), as well as 43 (51.2%) were repeat gains and 41 (48.8%) repeat losses (ratio, 1:0.95). Medians from Bayesian estimation of locus-specific mutation rates ranged from 0.0003 for DYS448 to 0.0074 for DYS458, with a median rate across all 17 Y-STRs of 0.0025. The mean age (at the time of son’s birth) of fathers with mutations was with 34.40 (±11.63) years higher than that of fathers without ones at 30.32 (±10.22) years, a difference that is highly statistically significant (p < 0.001). A Poisson-based modeling revealed that the Y-STR mutation rate increased with increasing father’s age on a statistically significant level (α = 0.0294, 2.5% quantile = 0.0001). From combining our data with those previously published, considering all together 135,212 meiotic events and 331 mutations, we conclude for the Yfiler Y-STRs that (1) none had a mutation rate of >1%, 12 had mutation rates of >0.1% and four of <0.1%, (2) single-repeat changes were strongly favored over multiple-repeat ones for all loci but 1 and (3) considerable variation existed among loci in the ratio of repeat gains versus losses. Our finding of three Y-STR mutations in one father–son pair (and two pairs with two mutations each) has consequences for determining the threshold of allelic differences to conclude exclusion constellations in future applications of Y-STRs in paternity testing and pedigree analyses.
Electronic supplementary material
The online version of this article (doi:10.1007/s00414-009-0342-y) contains supplementary material, which is available to authorized users.
Y-STR; Mutation; Microsatellites; Y-chromosome; AmpFlSTR YFiler kit
Although simple tandem repeats (STRs) comprise ~2% of the human genome and represent an important source of polymorphism, this class of variation remains understudied. We have developed a cost-effective strategy for performing targeted enrichment of STR regions that utilizes capture probes targeting the flanking sequences of STR loci, enabling specific capture of DNA fragments containing STRs for subsequent high-throughput sequencing. Utilizing a capture design targeting 6,243 STR loci <94bp and multiplexing eight individuals in a single Illumina HiSeq2000 sequencing lane we were able to call genotypes in at least one individual for 67.5% of the targeted STRs. We observed a strong relationship between (G+C) content and genotyping rate. STRs with moderate (G+C) content were recovered with >90% success rate, while only 12% of STRs with ≥80% (G+C) were genotyped in our assay. Analysis of a parent-offspring trio, complete hydatidiform mole samples, repeat analyses of the same individual, and Sanger sequencing-based validation indicated genotyping error rates between 7.6–12.4%. The majority of such errors were a single repeat unit at mono- or dinucleotide repeats. Altogether, our STR capture assay represents a cost-effective method that enables multiplexed genotyping of thousands of STR loci suitable for large scale population studies.
microsatellite; repeat variation; genome instability; high-throughput sequencing; sequence capture
To determine the human Y-chromosome haplogroup backgrounds of intermediate-sized variant alleles displayed by short tandem repeat (STR) loci DYS392, DYS449, and DYS385, and to evaluate the potential of each intermediate variant to elucidate new phylogenetic substructure within the human Y-chromosome haplogroup tree.
Molecular characterization of lineages was achieved using a combination of Y-chromosome haplogroup defining binary polymorphisms and up to 37 short tandem repeat loci. DNA sequencing and median-joining network analyses were used to evaluate Y-chromosome lineages displaying intermediate variant alleles.
We show that DYS392.2 occurs on a single haplogroup background, specifically I1*-M253, and likely represents a new phylogenetic subdivision in this European haplogroup. Intermediate variants DYS449.2 and DYS385.2 both occur on multiple haplogroup backgrounds, and when evaluated within specific haplogroup contexts, delineate new phylogenetic substructure, with DYS449.2 being informative within haplogroup A-P97 and DYS385.2 in haplogroups D-M145, E1b1a-M2, and R1b*-M343. Sequence analysis of variant alleles observed within the various haplogroup backgrounds showed that the nature of the intermediate variant differed, confirming the mutations arose independently.
Y-chromosome short tandem repeat intermediate variant alleles, while relatively rare, typically occur on multiple haplogroup backgrounds. This distribution indicates that such mutations arise at a rate generally intermediate to those of binary markers and Y-STR loci. As a result, intermediate-sized Y-STR variants can reveal phylogenetic substructure within the Y-chromosome phylogeny not currently detected by either binary or Y-STR markers alone, but only when such variants are evaluated within a haplogroup context.
Interruptions of microsatellite sequences impact genome evolution and can alter disease manifestation. However, human polymorphism levels at interrupted microsatellites (iMSs) are not known at a genome-wide scale, and the pathways for gaining interruptions are poorly understood. Using the 1000 Genomes Phase-1 variant call set, we interrogated mono-, di-, tri-, and tetranucleotide repeats up to 10 units in length. We detected ∼26,000–40,000 iMSs within each of four human population groups (African, European, East Asian, and American). We identified population-specific iMSs within exonic regions, and discovered that known disease-associated iMSs contain alleles present at differing frequencies among the populations. By analyzing longer microsatellites in primate genomes, we demonstrate that single interruptions result in a genome-wide average two- to six-fold reduction in microsatellite mutability, as compared with perfect microsatellites. Centrally located interruptions lowered mutability dramatically, by two to three orders of magnitude. Using a biochemical approach, we tested directly whether the mutability of a specific iMS is lower because of decreased DNA polymerase strand slippage errors. Modeling the adenomatous polyposis coli tumor suppressor gene sequence, we observed that a single base substitution interruption reduced strand slippage error rates five- to 50-fold, relative to a perfect repeat, during synthesis by DNA polymerases α, β, or η. Computationally, we demonstrate that iMSs arise primarily by base substitution mutations within individual human genomes. Our biochemical survey of human DNA polymerase α, β, δ, κ, and η error rates within certain microsatellites suggests that interruptions are created most frequently by low fidelity polymerases. Our combined computational and biochemical results demonstrate that iMSs are abundant in human genomes and are sources of population-specific genetic variation that may affect genome stability. The genome-wide identification of iMSs in human populations presented here has important implications for current models describing the impact of microsatellite polymorphisms on gene expression.
Microsatellites are short tandem repeat DNA sequences located throughout the human genome that display a high degree of inter-individual variation. This characteristic makes microsatellites an attractive tool for population genetics and forensics research. Some microsatellites affect gene expression, and mutations within such microsatellites can cause disease. Interruption mutations disrupt the perfect repeated array and are frequently associated with altered disease risk, but they have not been thoroughly studied in human genomes. We identified interrupted mono-, di-, tri- and tetranucleotide MSs (iMS) within individual genomes from African, European, Asian and American population groups. We show that many iMSs, including some within disease-associated genes, are unique to a single population group. By measuring the conservation of microsatellites between human and chimpanzee genomes, we demonstrate that interruptions decrease the probability of microsatellite mutations throughout the genome. We demonstrate that iMSs arise in the human genome by single base changes within the DNA, and provide biochemical data suggesting that these stabilizing changes may be created by error-prone DNA polymerases. Our genome-wide study supports the model in which iMSs act to stabilize individual genomes, and suggests that population-specific differences in microsatellite architecture may be an avenue by which genetic ancestry impacts individual disease risk.
Short tandem repeats (STRs) are abundant in human genomes. Numerous STRs have been shown to be associated with genetic diseases and gene regulatory functions, and have been selected as genetic markers for evolutionary and forensic analyses. High-throughput next generation sequencers have fostered new cutting-edge computing techniques for genome-scale analyses, and cross-genome comparisons have facilitated the efficient identification of polymorphic STR markers for various applications.
An automated and efficient system for detecting human polymorphic STRs at the genome scale is proposed in this study. Assembled contigs from next generation sequencing data were aligned and calibrated according to selected reference sequences. To verify identified polymorphic STRs, human genomes from the 1000 Genomes Project were employed for comprehensive analyses, and STR markers from the Combined DNA Index System (CODIS) and disease-related STR motifs were also applied as cases for evaluation. In addition, we analyzed STR variations for highly conserved homologous genes and human-unique genes. In total 477 polymorphic STRs were identified from 492 human-unique genes, among which 26 STRs were retrieved and clustered into three different groups for efficient comparison.
We have developed an online system that efficiently identifies polymorphic STRs and provides novel distinguishable STR biomarkers for different levels of specificity. Candidate polymorphic STRs within a personal genome could be easily retrieved and compared to the constructed STR profile through query keywords, gene names, or assembled contigs.
Short tandem repeat; 1000 Genomes Project; CODIS; Next generation sequencing; genetic disease; orthologous gene; human-unique gene
To develop novel DNA extraction and typing procedure for DNA identification of the 7th century human remains, determine the familiar relationship between the individuals, estimate the Y-chromosome haplogroup, and compare the Y-chromosome haplotype with the contemporary populations.
DNA from preserved femur samples was extracted using the modified silica-based extraction technique. Polymerase chain reaction amplification was performed using human identification kits MiniFiler, Identifiler, and Y-filer and also laboratory-developed and validated Y-chromosome short tandem repeat (STR) pentaplexes with short amplicons.
For 244A, 244B, 244C samples, full autosomal DNA profiles (15 STR markers and Amelogenin) and for 244D, 244E, 244F samples, MiniFiler profiles were produced. Y-chromosome haplotypes consisting of up to 24 STR markers were determined and used to predict the Y-chromosome haplogroups and compare the resulting haplotypes with the current population. Samples 244A, 244B, 244C, and 244D belong to Y-chromosome haplogroup R1b and the samples 244E and 244F to haplogroup G2a. Comparison of ancient haplotypes with the current population yielded numerous close matches with genetic distance bellow 2.
Application of forensic genetics in archaeology enables retrieving new types of information and helps in data interpretation. The number of successfully typed autosomal and Y-STR loci from ancient specimens in this study is one of the largest published so far for aged samples.
Recently, the Combined DNA Index System (CODIS) Core Loci Working Group established by the US Federal Bureau of Investigation (FBI) reviewed and recommended changes to the CODIS core loci. The Working Group identified 20 short tandem repeat (STR) loci (composed of the original CODIS core set loci (minus TPOX), four European recommended loci, PentaE, and DYS391) plus the Amelogenin marker as the new core set. Before selecting and finalizing the core loci, some evaluations are needed to provide guidance for the best options of core selection.
The performance of current and newly proposed CODIS core loci sets were evaluated with simplified analyses for adventitious hit rates in reasonably large datasets under single-source profile comparisons, mixture comparisons and kinship searches, and for international data sharing. Informativeness (for example, match probability, average kinship index (AKI)) and mutation rates of each locus were some of the criteria to consider for loci selection. However, the primary factor was performance with challenged forensic samples.
The current battery of loci provided in already validated commercial kits meet the needs for single-source profile comparisons and international data sharing, even with relatively large databases. However, the 13 CODIS core loci are not sufficiently powerful for kinship analyses and searching potential contributors of mixtures in larger databases; 19 or more autosomal STR loci perform better. Y-chromosome STR (Y-STR) loci are very useful to trace paternal lineage, deconvolve female and male mixtures, and resolve inconsistencies with Amelogenin typing. The DYS391 locus is of little theoretical or practical use. Combining five or six Y-chromosome STR loci with existing autosomal STR loci can produce better performance than the same number of autosomal loci for kinship analysis and still yield a sufficiently low match probability for single-source profile comparisons.
A more comprehensive study should be performed to provide the necessary information to decision makers and stakeholders about the construction of a new set of core loci for CODIS. Finally, selection of loci should be driven by the concept that the needs of casework should be supported by the processes of CODIS (or for that matter any forensic DNA database).
Here we describe a new panel of short tandem repeats (STRs) for a novel exact typing assay that can be used to discriminate between Aspergillus fumigatus isolates. A total of nine STR markers were selected from available genomic A. fumigatus sequences and were divided into three multicolor multiplex PCRs. Each multiplex reaction amplified three di-, tri-, or tetranucleotide repeats, respectively. All nine STR markers were used to analyze 100 presumably unrelated A. fumigatus isolates. For each marker, between 11 and 37 alleles were found in this population. One isolate proved to be a mixture of at least two different isolates. With the remaining 99 isolates, 96 different fingerprinting profiles were obtained. The Simpson's diversity index for the individual markers ranged from 0.77 to 0.97. The diversity index for the multiplex combination of di-, tri-, and tetranucleotide repeats ranged from 0.9784 to 0.9968. The combination of all nine markers yielded a Simpson's diversity index of 0.9994, indicative of the high discriminatory power of these new loci. In theory, this panel of markers is able to discriminate between no less than 27 × 109 different genotypes. The multicolor multiplex approach allows large numbers of markers to be tested in a short period of time. The exact nature of the assay combines high reproducibility with the easy exchange of results and makes it a very suitable tool for large-scale epidemiological studies.
High rates of esophageal cancer (EC) are found in people of the Henan Taihang Mountain, Fujian Minnan, and Chaoshan regions of China. Historical records describe great waves of populations migrating from north-central China (the Henan and Shanxi Hans) through coastal Fujian Province to the Chaoshan plain. Although these regions are geographically distant, we hypothesized that EC high-risk populations in these three areas could share a common ancestry. Accordingly, we used 16 East Asian-specific Y-chromosome biallelic markers (single nucleotide polymorphisms; Y-SNPs) and six Y-chromosome short tandem repeat (Y-STR) loci to infer the origin of the EC high-risk Chaoshan population (CSP) and the genetic relationship between the CSP and the EC high-risk Henan Taihang Mountain population (HTMP) and Fujian population (FJP). The predominant haplogroups in these three populations are O3*, O3e*, and O3e1, with no significant difference between the populations in the frequency of these genotypes. Frequency distribution and principal component analysis revealed that the CSP is closely related to the HTMP and FJP, even though the former is geographically nearer to other populations (Guangfu and Hakka clans). The FJP is between the CSP and HTMP in the principal component plot. The CSP, FJP and HTMP are more closely related to Chinese Hans than to minorities, except Manchu Chinese, and are descendants of Sino-Tibetans, not Baiyues. Correlation analysis, hierarchical clustering analysis, and phylogenetic analysis (neighbor-joining tree) all support close genetic relatedness among the CSP, FJP and HTMP. The network for haplogroup O3 (including O3*, O3e* and O3e1) showed that the HTMP have highest STR haplotype diversity, suggesting that the HTMP may be a progenitor population for the CSP and FJP. These findings support the potentially important role of shared ancestry in understanding more about the genetic susceptibility in EC etiology in high-risk populations and have implications for determining the molecular basis of this disease.
Tandem repeats (TRs) are unstable regions commonly found within genomes that have consequences for evolution and disease. In humans, polymorphic TRs are known to cause neurodegenerative and neuromuscular disorders as well as being associated with complex diseases such as diabetes and cancer. If present in upstream regulatory regions, TRs can modify chromatin structure and affect transcription; resulting in altered gene expression and protein abundance. The most common TRs are short tandem repeats (STRs), or microsatellites. Promoter located STRs are considerably more polymorphic than coding region STRs. As such, they may be a common driver of phenotypic variation. To study STRs located in regulatory regions, we have performed genome-wide analysis to identify all STRs present in a region that is 2 kilobases upstream and 1 kilobase downstream of the transcription start sites of genes.
The Short Tandem Repeats in Regulatory Regions Table, STaRRRT, contains the results of the genome-wide analysis, outlining the characteristics of 5,264 STRs present in the upstream regulatory region of 4,441 human genes. Gene set enrichment analysis has revealed significant enrichment for STRs in cellular, transcriptional and neurological system gene promoters and genes important in ion and calcium homeostasis. The set of enriched terms has broad similarity to that seen in coding regions, suggesting that regulatory region STRs are subject to similar evolutionary pressures as STRs in coding regions and may, like coding region STRs, have an important role in controlling gene expression.
STaRRRT is a readily-searchable resource for investigating potentially polymorphic STRs that could influence the expression of any gene of interest. The processes and genes enriched for regulatory region STRs provide potential novel targets for diagnosing and treating disease, and support a role for these STRs in the evolution of the human genome.
Short tandem repeats; STR; Microsatellites; Simple sequence repeats; SSR; Promoter; Regulatory region; Neurological disease; Neural genes; Evolution
Y-chromosome microsatellites (Y-STRs) are typically used for kinship analysis and forensic identification, as well as for inferences on population history and evolution. All applications would greatly benefit from reliable locus-specific mutation rates, to improve forensic probability calculations and interpretations of diversity data. However, estimates of mutation rate from father–son transmissions are available for few loci and have large confidence intervals, because of the small number of meiosis usually observed. By contrast, population data exist for many more Y-STRs, holding unused information about their mutation rates. To incorporate single locus diversity information into Y-STR mutation rate estimation, we performed a meta-analysis using pedigree data for 80 loci and individual haplotypes for 110 loci, from 29 and 93 published studies, respectively. By means of logistic regression we found that relative genetic diversity, motif size and repeat structure explain the variance of observed rates of mutations from meiosis. This model allowed us to predict locus-specific mutation rates (mean predicted mutation rate 2.12 × 10−3, SD=1.58 × 10−3), including estimates for 30 loci lacking meiosis observations and 41 with a previous estimate of zero. These estimates are more accurate than meiosis-based estimates when a small number of meiosis is available. We argue that our methodological approach, by taking into account locus diversity, could be also adapted to estimate population or lineage-specific mutation rates. Such adjusted estimates would represent valuable information for selecting the most reliable markers for a wide range of applications.
mutation rate; Y-chromosome microsatellites; meiosis; population genetics; glm
The results of our bioinformatics analysis have found over 91,000 di-, tri-, and tetranucleotide microsatellites in our survey of 25% of the X. tropicalis genome, suggesting there may be over 360,000 within the entire genome. Within the X. tropicalis genome, dinucleotide (78.7%) microsatellites vastly out numbered tri- and tetranucleotide microsatellites. Similarly, AT-rich repeats are overwhelmingly dominant. The four AT-only motifs (AT, AAT, AAAT, and AATT) account for 51,858 out of 91,304 microsatellites found. Individually, AT microsatellites were the most common repeat found, representing over half of all di-, tri-, and tetranucleotide microsatellites. This contrasts with data from other studies, which show that AC is the most frequent microsatellite in vertebrate genomes (Toth et al. 2000). In addition, we have determined the rate of polymorphism for 5,128 non-redundant microsatellites, embedded in unique sequences. Interestingly, this subgroup of microsatellites was determined to have significantly longer repeats than genomic microsatellites as a whole. In addition, microsatellite loci with tandem repeat lengths more than 30 bp exhibited a significantly higher degree of polymorphism than other loci. Pairwise comparisons show that tetranucleotide microsatellites have the highest polymorphic rates. In addition, AAT and ATC showed significant higher polymorphism than other trinucleotide microsatellites, while AGAT and AAAG were significantly more polymorphic than other tetranucleotide microsatellites.
microsatellite; polymorphism; Xenopus genome
The dynamics of microsatellite, or short tandem repeats (STRs), is well documented for long, polymorphic loci, but much less is known for shorter ones. For example, the issue of a minimum threshold length for DNA slippage remains contentious. Model-fitting methods have generally concluded that slippage only occurs over a threshold length of about eight nucleotides, in contradiction with some direct observations of tandem duplications at shorter repeated sites. Using a comparative analysis of the human and chimpanzee genomes, we examined the mutation patterns at microsatellite loci with lengths as short as one period plus one nucleotide. We found that the rates of tandem insertions and deletions at microsatellite loci strongly deviated from background rates in other parts of the human genome and followed an exponential increase with STR size. More importantly, we detected no lower threshold length for slippage. The rate of tandem duplications at unrepeated sites was higher than expected from random insertions, providing evidence for genome-wide action of indel slippage (an alternative mechanism generating tandem repeats). The rate of point mutations adjacent to STRs did not differ from that estimated elsewhere in the genome, except around dinucleotide loci. Our results suggest that the emergence of STR depends on DNA slippage, indel slippage, and point mutations. We also found that the dynamics of tandem insertions and deletions differed in both rates and size at which these mutations take place. We discuss these results in both evolutionary and mechanistic terms.
tandem repeats; comparative genomics; microsatellite emergence; DNA slippage; indel slippage; point mutations; human
Sakha – an area connecting South and Northeast Siberia – is significant for understanding the history of peopling of Northeast Eurasia and the Americas. Previous studies have shown a genetic contiguity between Siberia and East Asia and the key role of South Siberia in the colonization of Siberia.
We report the results of a high-resolution phylogenetic analysis of 701 mtDNAs and 318 Y chromosomes from five native populations of Sakha (Yakuts, Evenks, Evens, Yukaghirs and Dolgans) and of the analysis of more than 500,000 autosomal SNPs of 758 individuals from 55 populations, including 40 previously unpublished samples from Siberia. Phylogenetically terminal clades of East Asian mtDNA haplogroups C and D and Y-chromosome haplogroups N1c, N1b and C3, constituting the core of the gene pool of the native populations from Sakha, connect Sakha and South Siberia. Analysis of autosomal SNP data confirms the genetic continuity between Sakha and South Siberia. Maternal lineages D5a2a2, C4a1c, C4a2, C5b1b and the Yakut-specific STR sub-clade of Y-chromosome haplogroup N1c can be linked to a migration of Yakut ancestors, while the paternal lineage C3c was most likely carried to Sakha by the expansion of the Tungusic people. MtDNA haplogroups Z1a1b and Z1a3, present in Yukaghirs, Evens and Dolgans, show traces of different and probably more ancient migration(s). Analysis of both haploid loci and autosomal SNP data revealed only minor genetic components shared between Sakha and the extreme Northeast Siberia. Although the major part of West Eurasian maternal and paternal lineages in Sakha could originate from recent admixture with East Europeans, mtDNA haplogroups H8, H20a and HV1a1a, as well as Y-chromosome haplogroup J, more probably reflect an ancient gene flow from West Eurasia through Central Asia and South Siberia.
Our high-resolution phylogenetic dissection of mtDNA and Y-chromosome haplogroups as well as analysis of autosomal SNP data suggests that Sakha was colonized by repeated expansions from South Siberia with minor gene flow from the Lower Amur/Southern Okhotsk region and/or Kamchatka. The minor West Eurasian component in Sakha attests to both recent and ongoing admixture with East Europeans and an ancient gene flow from West Eurasia.
mtDNA; Y chromosome; Autosomal SNPs; Sakha
In recent years it has been demonstrated that structural variations, such as indels (insertions and deletions), are common throughout the genome, but the implications of structural variations are still not clearly understood. Long tandem repeats (e.g. microsatellites or simple repeats) are known to be hypermutable (indel-rich), but are rare in exons and only occasionally associated with diseases. Here we focus on short (imperfect) tandem repeats (STRs) which fall below the radar of conventional tandem repeat detection, and investigate whether STRs are targets for disease-related mutations in human exons. In particular, we test whether they share the hypermutability of the longer tandem repeats and whether disease-related genes have a higher STR content than non-disease-related genes.
We show that validated human indels are extremely common in STR regions compared to non-STR regions. In contrast to longer tandem repeats, our definition of STRs found them to be present in exons of most known human genes (92%), 99% of all STR sequences in exons are shorter than 33 base pairs and 62% of all STR sequences are imperfect repeats. We also demonstrate that STRs are significantly overrepresented in disease-related genes in both human and mouse. These results are preserved when we limit the analysis to STRs outside known longer tandem repeats.
Based on our findings we conclude that STRs represent hypermutable regions in the human genome that are linked to human disease. In addition, STRs constitute an obvious target when screening for rare mutations, because of the relatively low amount of STRs in exons (1,973,844 bp) and the limited length of STR regions.
To report on the use of STR, Y-STRs, and miniSTRs typing methods in the identification of victims of revolutionary violence and crimes against humanity committed by the Communist Armed Forces during and after World War II in which bodies were exhumed from mass and individual graves in Slovenia.
Bone fragments and teeth were removed from human remains found in several small and closely located hidden mass graves in the Škofja Loka area (Lovrenska Grapa and Žolšče) and 2 individual graves in the Ljubljana area (Podlipoglav), Slovenia. DNA was isolated using the Qiagen DNA extraction procedure optimized for bone and teeth. Some DNA extracts required additional purification, such as N-buthanol treatment. The QuantifilerTM Human DNA Quantification Kit was used for DNA quantification. Initially, PowerPlex 16 kit was used to simultaneously analyze 15 short tandem repeat (STR) loci. The PowerPlex S5 miniSTR kit and AmpFℓSTR® MiniFiler PCR Amplification Kit was used for additional analysis if preliminary analysis yielded weak partial or no profiles at all. In 2 cases, when the PowerPlex 16 profiles indicated possible relatedness of the remains with reference samples, but there were insufficient probabilities to call the match to possible male paternal relatives, we resorted to an additional analysis of Y-STR markers. PowerPlex® Y System was used to simultaneously amplify 12 Y-STR loci. Fragment analysis was performed on an ABI PRISM 310 genetic analyzer. Matching probabilities were estimated using the DNA-View software.
Following the Y-STR analysis, 1 of the “weak matches” previously obtained based on autosomal loci, was confirmed while the other 1 was not. Combined standard STR and miniSTR approach applied to bone samples from 2 individual graves resulted in positive identifications. Finally, using the same approach on 11 bone samples from hidden mass grave Žološče, we were able to obtain 6 useful DNA profiles.
The results of this study, in combination with previously obtained results, demonstrate that Y-chromosome testing and miniSTR methodology can contribute to the identification of human remains of victims of revolutionary violence from World War II.
Technological advances have lead to the rapid increase in availability of single nucleotide polymorphisms (SNPs) in a range of organisms, and there is a general optimism that SNPs will become the marker of choice for a range of evolutionary applications. Here, comparisons between 300 polymorphic SNPs and 14 short tandem repeats (STRs) were conducted on a data set consisting of approximately 500 Atlantic salmon arranged in 10 samples/populations.
Global FST ranged from 0.033-0.115 and -0.002-0.316 for the 14 STR and 300 SNP loci respectively. Global FST was similar among 28 linkage groups when averaging data from mapped SNPs. With the exception of selecting a panel of SNPs taking the locus displaying the highest global FST for each of the 28 linkage groups, which inflated estimation of genetic differentiation among the samples, inferred genetic relationships were highly similar between SNP and STR data sets and variants thereof. The best 15 SNPs (30 alleles) gave a similar level of self-assignment to the best 4 STR loci (83 alleles), however, addition of further STR loci did not lead to a notable increase assignment whereas addition of up to 100 SNP loci increased assignment.
Whilst the optimal combinations of SNPs identified in this study are linked to the samples from which they were selected, this study demonstrates that identification of highly informative SNP loci from larger panels will provide researchers with a powerful approach to delineate genetic relationships at the individual and population levels.
Numerous studies of human populations in Europe and Asia have revealed a concordance between their extant genetic structure and the prevailing regional pattern of geography and language. For native South Americans, however, such evidence has been lacking so far. Therefore, we examined the relationship between Y-chromosomal genotype on the one hand, and male geographic origin and linguistic affiliation on the other, in the largest study of South American natives to date in terms of sampled individuals and populations. A total of 1,011 individuals, representing 50 tribal populations from 81 settlements, were genotyped for up to 17 short tandem repeat (STR) markers and 16 single nucleotide polymorphisms (Y-SNPs), the latter resolving phylogenetic lineages Q and C. Virtually no structure became apparent for the extant Y-chromosomal genetic variation of South American males that could sensibly be related to their inter-tribal geographic and linguistic relationships. This continent-wide decoupling is consistent with a rapid peopling of the continent followed by long periods of isolation in small groups. Furthermore, for the first time, we identified a distinct geographical cluster of Y-SNP lineages C-M217 (C3*) in South America. Such haplotypes are virtually absent from North and Central America, but occur at high frequency in Asia. Together with the locally confined Y-STR autocorrelation observed in our study as a whole, the available data therefore suggest a late introduction of C3* into South America no more than 6,000 years ago, perhaps via coastal or trans-Pacific routes. Extensive simulations revealed that the observed lack of haplogroup C3* among extant North and Central American natives is only compatible with low levels of migration between the ancestor populations of C3* carriers and non-carriers. In summary, our data highlight the fact that a pronounced correlation between genetic and geographic/cultural structure can only be expected under very specific conditions, most of which are likely not to have been met by the ancestors of native South Americans.
In the largest population genetic study of South Americans to date, we analyzed the Y-chromosomal makeup of more than 1,000 male natives. We found that the male-specific genetic variation of Native Americans lacks any clear structure that could sensibly be related to their geographic and/or linguistic relationships. This finding is consistent with a rapid initial peopling of South America, followed by long periods of isolation in small tribal groups. The observed continent-wide decoupling of geography, spoken language, and genetics contrasts strikingly with previous reports of such correlation from many parts of Europe and Asia. Moreover, we identified a cluster of Native American founding lineages of Y chromosomes, called C-M217 (C3*), within a restricted area of Ecuador in North-Western South America. The same haplogroup occurs at high frequency in Central, East, and North East Asia, but is virtually absent from North (except Alaska) and Central America. Possible scenarios for the introduction of C-M217 (C3*) into Ecuador may thus include a coastal or trans-Pacific route, an idea also supported by occasional archeological evidence and the recent coalescence of the C3* haplotypes, estimated from our data to have occurred some 6,000 years ago.
To perform a genetic characterization of 7 skeletons from medieval age found in a burial site in the Aragonese Pyrenees.
Allele frequencies of autosomal short tandem repeats (STR) loci were determined by 3 different STR systems. Mitochondrial DNA (mtDNA) and Y-chromosome haplogroups were determined by sequencing of the hypervariable segment 1 of mtDNA and typing of phylogenetic Y chromosome single nucleotide polymorphisms (Y-SNP) markers, respectively. Possible familial relationships were also investigated.
Complete or partial STR profiles were obtained in 3 of the 7 samples. Mitochondrial DNA haplogroup was determined in 6 samples, with 5 of them corresponding to the haplogroup H and 1 to the haplogroup U5a. Y-chromosome haplogroup was determined in 2 samples, corresponding to the haplogroup R. In one of them, the sub-branch R1b1b2 was determined. mtDNA sequences indicated that some of the individuals could be maternally related, while STR profiles indicated no direct family relationships.
Despite the antiquity of the samples and great difficulty that genetic analyses entail, the combined use of autosomal STR markers, Y-chromosome informative SNPs, and mtDNA sequences allowed us to genotype a group of skeletons from the medieval age.
Genetic variation on the non-recombining portion of the Y chromosome contains information about the ancestry of male lineages. Because of their low rate of mutation, single nucleotide polymorphisms (SNPs) are the markers of choice for unambiguously classifying Y chromosomes into related sets of lineages known as haplogroups, which tend to show geographic structure in many parts of the world. However, performing the large number of SNP genotyping tests needed to properly infer haplogroup status is expensive and time consuming. A novel alternative for assigning a sampled Y chromosome to a haplogroup is presented here. We show that by applying modern machine-learning algorithms we can infer with high accuracy the proper Y chromosome haplogroup of a sample by scoring a relatively small number of Y-linked short tandem repeats (STRs). Learning is based on a diverse ground-truth data set comprising pairs of SNP test results (haplogroup) and corresponding STR scores. We apply several independent machine-learning methods in tandem to learn formal classification functions. The result is an integrated high-throughput analysis system that automatically classifies large numbers of samples into haplogroups in a cost-effective and accurate manner.
The Y chromosome is passed on from father to son as a nearly identical copy. Occasionally, small random changes occur in the Y DNA sequences that are passed forward to the next generation. There are two kinds of changes that may occur, and they both provide vital information for the study of human ancestry. Of the two kinds, one is a single letter change, and the other is a change in the number of short tandemly repeating sequences. The single-letter changes can be laborious to test, but they provide information on deep ancestry. Measuring the number of sequence repeats at multiple places in the genome simultaneously is efficient, and provides information about recent history at a modest cost. We present the novel approach of training a collection of modern machine-learning algorithms with these sequence repeats to infer the single-letter changes, thus assigning the samples to deep ancestry lineages.
A tandem repeat’s (TR) propensity to mutate increases with repeat number, and can become very pronounced beyond a critical boundary, transforming it into a microsatellite (MS). However, a clear understanding of the mutational behavior of different TR classes and motifs and related mechanisms is lacking, as is a consensus on the existence of a boundary separating short TRs (STRs) from MSs. This hinders our understanding of MSs’ mutational properties and their effective use as genetic markers. Using indel calls for 179 individuals from 1000 Genomes Pilot-1 Project, we determined polymorphism incidence for four major TR classes, and formalized its varying relationship with repeat number using segmented regression. We observed a biphasic regime with a transition from a faster to a slower exponential growth at 9, 5, 4, and 4 repeats for mono-, di-, tri-, and tetranucleotide TRs, respectively. We used an in vitro mutagenesis assay to evaluate the contribution of strand slippage errors to mutability. STRs and MSs differ in their absolute polymorphism levels, but more importantly in their rates of mutability growth. Although strand slippage is a major factor driving mononucleotide polymorphism incidence, dinucleotide polymorphism incidence is greater than that expected due to strand slippage alone, indicating that additional cellular factors might be driving dinucleotide mutability in the human genome. Leveraging on hundreds of human genomes, we present the first comprehensive, genome-wide analysis of TR mutational behavior, encompassing several motif sizes and compositions.
tandem repeats; short tandem repeats; microsatellites; replication slippage; segmented regression; change point