Background In a recent paper by Homer et al. (Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet 2008;4:e1000167), a method for detecting whether a given individual is a contributor to a particular genomic mixture was proposed. This prompted grave concern about the public dissemination of aggregate statistics from genome-wide association studies. It is of clear scientific importance that such data be shared widely, but the confidentiality of study participants must not be compromised. The issue of what summary genomic data can safely be posted on the web is only addressed satisfactorily when the theoretical underpinnings of the proposed method are clarified and its performance evaluated in terms of dependence on underlying assumptions.
Methods The original method raised a number of concerns and several alternatives have since been proposed, including a simple linear regression approach. In our proposed generalized estimating equation approach, we maintain the simplicity of the linear regression model but obtain inferences that are more robust to approximation of the variance/covariance structure and can accommodate linkage disequilibrium.
Results We affirm that, in principle, it is possible to determine that a ‘candidate’ individual has participated in a study, given a subset of aggregate statistics from that study. However, the methods depend critically on a number of key factors including: the ancestry of participants in the study; the absolute and relative numbers of cases and controls; and the number of single nucleotide polymorphisms.
Conclusions Simple guidelines for publication that are based on a single criterion are therefore unlikely to suffice. In particular, ‘directed’ summary statistics should not be posted openly on the web but could be protected by an internet-based access check as proposed by the P3G_Consortium et al. (Public access to genome-wide data: five views on balancing research with privacy and protection. PLoS Genet 2009;5:e1000665).
Identification; linear regression; generalized estimating equations; linkage disequilibrium; case–control genetic association studies
Homer and others (2008. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genetics 4, e1000167) recently showed that, given allele frequency data for a large number of single nucleotide polymorphisms in a sample together with corresponding population “reference” frequencies, by typing an individual's DNA sample at the same set of loci it can be inferred whether or not the individual was a member of the sample. This observation has been responsible for precautionary removal of large amounts of summary data from public access. This and further work on the problem has followed a frequentist approach. This paper sets out a Bayesian analysis of this problem which clarifies the role of the reference frequencies and allows incorporation of prior probabilities of the individual's membership in the sample.
Bayesian analysis; Data confidentiality; Statistical genetics
Samples containing highly unbalanced DNA mixtures from two individuals commonly occur both in forensic mixed stains and in peripheral blood DNA microchimerism induced by pregnancy or following organ transplant. Because of PCR amplification bias, the genetic identification of a DNA that contributes trace amounts to a mixed sample represents a tremendous challenge. This means that standard genetic markers, namely microsatellites, also referred as short tandem repeats (STR), and single-nucleotide polymorphism (SNP) have limited power in addressing common questions of forensic and medical genetics. To address this issue, we developed a molecular marker, named DIP–STR that relies on pairing deletion–insertion polymorphisms (DIP) with STR. This novel analytical approach allows for the unambiguous genotyping of a minor component in the presence of a major component, where DIP–STR genotypes of the minor were successfully procured at ratios up to 1:1,000. The compound nature of this marker generates a high level of polymorphism that is suitable for identity testing. Here, we demonstrate the power of the DIP–STR approach on an initial set of nine markers surveyed in a Swiss population. Finally, we discuss the limitations and potential applications of our new system including preliminary tests on clinical samples and estimates of their performance on simulated DNA mixtures.
compound genetic marker; forensic; DNA microchimerism; diagnostics
A new test was recently developed that could use a high-density set of single nucleotide polymorphisms (SNPs) to determine whether a specific individual contributed to a mixture of DNA. The test statistic compared the genotype for the individual to the allele frequencies in the mixture and to the allele frequencies in a reference group. This test requires the ancestries of the reference group to be nearly identical to those of the contributors to the mixture. Here, we first quantify the bias, the increase in type I and type II error, when the ancestries are not well matched. Then, we show that the test can also be biased if the number of subjects in the two groups differ or if the platforms used to measure SNP intensities differ. We then introduce a new test statistic and a test that only requires the ancestries of the reference group to be similar to the individual of interest, and show that this test is not only robust to the number of subjects and platform, but also has increased power of detection. The two tests are compared on both HapMap and simulated data.
Mitochondrial DNA (mtDNA) variation is commonly analyzed in a wide range of different biomedical applications. Cases where more than one individual contribute to a stain genotyped from some biological material give rise to a mixture. Most forensic mixture cases are analyzed using autosomal markers. In rape cases, Y-chromosome markers typically add useful information. However, there are important cases where autosomal and Y-chromosome markers fail to provide useful profiles. In some instances, usually involving small amounts or degraded DNA, mtDNA may be the only useful genetic evidence available. Mitochondrial DNA mixtures also arise in studies dealing with the role of mtDNA variation in tumorigenesis. Such mixtures may be generated by the tumor, but they could also originate in vitro due to inadvertent contamination or a sample mix-up.
We present the statistical methods needed for mixture interpretation and emphasize the modifications required for the more well-known methods based on conventional markers to generalize to mtDNA mixtures. Two scenarios are considered. Firstly, only categorical mtDNA data is assumed available, that is, the variants contributing to the mixture. Secondly, quantitative data (peak heights or areas) on the allelic variants are also accessible. In cases where quantitative information is available in addition to allele designation, it is possible to extract more precise information by using regression models. More precisely, using quantitative information may lead to a unique solution in cases where the qualitative approach points to several possibilities. Importantly, these methods also apply to clinical cases where contamination is a potential alternative explanation for the data.
We argue that clinical and forensic scientists should give greater consideration to mtDNA for mixture interpretation. The results and examples show that the analysis of mtDNA mixtures contributes substantially to forensic casework and may also clarify erroneous claims made in clinical genetics regarding tumorigenesis.
Aggregate results from Genome-Wide Association Studies (GWAS)1-3, such as genotype frequencies for cases and controls were available until recently on public websites4-5 because they were thought to reveal negligible information concerning an individual’s participation in a study. Homer et al.6 suggested a method for forensic detection of an individual’s contribution to an admixed DNA sample could be applied to aggregate GWAS data. Using a likelihood-based statistical framework, we develop an improved statistic that uses genotype frequencies and an individual’s genotypes to infer whether the individual or a close relative participated in the GWAS and, if so, the participant’s phenotype status. Our statistic compares the logarithm of genotype frequencies, in contrast to that of Homer et al.6, which is based on differences in SNP probe intensity or allele frequencies. We derive the theoretical power of the test statistics and explore the empirical performance in scenarios with a varying numbers of randomly chosen or top-associated SNPs.
We have developed a new method using the Qbead™ system for high-throughput genotyping of single nucleotide polymorphisms (SNPs). The Qbead system employs fluorescent Qdot™ semiconductor nanocrystals, also known as quantum dots, to encode microspheres that subsequently can be used as a platform for multiplexed assays. By combining mixtures of quantum dots with distinct emission wavelengths and intensities, unique spectral ‘barcodes’ are created that enable the high levels of multiplexing required for complex genetic analyses. Here, we applied the Qbead system to SNP genotyping by encoding microspheres conjugated to allele-specific oligonucleotides. After hybridization of oligonucleotides to amplicons produced by multiplexed PCR of genomic DNA, individual microspheres are analyzed by flow cytometry and each SNP is distinguished by its unique spectral barcode. Using 10 model SNPs, we validated the Qbead system as an accurate and reliable technique for multiplexed SNP genotyping. By modifying the types of probes conjugated to microspheres, the Qbead system can easily be adapted to other assay chemistries for SNP genotyping as well as to other applications such as analysis of gene expression and protein–protein interactions. With its capability for high-throughput automation, the Qbead system has the potential to be a robust and cost-effective platform for a number of applications.
Pooling genomic DNA samples within clinical classes of disease followed by genotyping on whole-genome SNP microarrays, allows for rapid and inexpensive genome-wide association studies. Key to the success of these studies is the accuracy of the allelic frequency calculations, the ability to identify false-positives arising from assay variability and the ability to better resolve association signals through analysis of neighbouring SNPs.
We report the accuracy of allelic frequency measurements on pooled genomic DNA samples by comparing these measurements to the known allelic frequencies as determined by individual genotyping. We describe modifications to the calculation of k-correction factors from relative allele signal (RAS) values that remove biases and result in more accurate allelic frequency predictions. Our results show that the least accurate SNPs, those most likely to give false-positives in an association study, are identifiable by comparing their frequencies to both those from a known database of individual genotypes and those of the pooled replicates. In a disease with a previously identified genetic mutation, we demonstrate that one can identify the disease locus through the comparison of the predicted allelic frequencies in case and control pools. Furthermore, we demonstrate improved resolution of association signals using the mean of individual test-statistics for consecutive SNPs windowed across the genome. A database of k-correction factors for predicting allelic frequencies for each SNP, derived from several thousand individually genotyped samples, is provided. Lastly, a Perl script for calculating RAS values for the Affymetrix platform is provided.
Our results illustrate that pooling of DNA samples is an effective initial strategy to identify a genetic locus. However, it is important to eliminate inaccurate SNPs prior to analysis by comparing them to a database of individually genotyped samples as well as by comparing them to replicates of the pool. Lastly, detection of association signals can be improved by incorporating data from neighbouring SNPs.
Motivations: High-throughput sequencing has made it possible to sequence DNA methylation of a whole genome at the single-base resolution. A sample, however, may contain a number of distinct methylation patterns. For instance, cells of different types and in different developmental stages may have different methylation patterns. Alleles may be differentially methylated, which may partially explain that the large portions of epigenomes from single cell types are partially methylated, and may have major effects on transcriptional output. Approaches relying on DNA sequence polymorphism to identify individual patterns from a mixture of heterogeneous epigenomes are insufficient as methylcytosines occur at a much higher density than SNPs.
Results: We have developed a mixture model-based approach for resolving distinct epigenomes from a heterogeneous sample. In particular, the model is applied to the detection of allele-specific methylation (ASM). The methods are tested on a synthetic methylome and applied to an Arabidopsis single root cell methylome.
Multiple investigators have established the feasibility of using buccal brush samples to genotype single nucleotide polymorphisms (SNPs) with high-density genome-wide microarrays, but there is currently no consensus on the accuracy of copy number variants (CNVs) inferred from these data. Regardless of the source of DNA, it is more difficult to detect CNVs than to genotype SNPs using these microarrays, and it therefore remains an open question whether buccal brush samples provide enough high-quality DNA for this purpose.
To demonstrate the quality of CNV calls generated from DNA extracted from buccal samples, compared to calls generated from blood samples, we evaluated the concordance of calls from individuals who provided both sample types. The Illumina Human660W-Quad BeadChip was used to determine SNPs and CNVs of 39 Arkansas participants in the National Birth Defects Prevention Study (NBDPS), including 16 mother-infant dyads, who provided both whole blood and buccal brush DNA samples.
We observed a 99.9% concordance rate of SNP calls in the 39 blood–buccal pairs. From the same dataset, we performed a similar analysis of CNVs. Each of the 78 samples was independently segmented into regions of like copy number using the Optimal Segmentation algorithm of Golden Helix SNP & Variation Suite 7.
Across 640,663 loci on 22 autosomal chromosomes, segment-mean log R ratios had an average correlation of 0.899 between blood-buccal pairs of samples from the same individual, while the average correlation between all possible blood-buccal pairs of samples from unrelated individuals was 0.318. An independent analysis using the QuantiSNP algorithm produced average correlations of 0.943 between blood-buccal pairs from the same individual versus 0.332 between samples from unrelated individuals.
Segment-mean log R ratios had an average correlation of 0.539 between mother-offspring dyads of buccal samples, which was not statistically significantly different than the average correlation of 0.526 between mother-offspring dyads of blood samples (p=0.302).
We observed performance from the subject-collected mail-in buccal brush samples comparable to that of blood. These results show that such DNA samples can be used for genome-wide scans of both SNPs and CNVs, and that high rates of CNV concordance were achieved whether using a change-point-based algorithm or one based on a hidden Markov model (HMM).
SNPs, Single nucleotide polymorphisms; CNVs, Copy number variants; NBDPS, National Birth Defects Prevention Study; Buccal brush
Forensic analysis of low template (LT) DNA mixtures is particularly complicated when (1) LT components concur with high template components, (2) more than three contributors are present, or (3) contributors are related. In this study, we generated a set of such complex LT mixtures and examined two methods to assist in DNA profile analysis and interpretation: the “n/2” consensus method (Benschop et al. 2011) and the pool profile approach. N/2 consensus profiles include alleles that are reproducibly amplified in at least half of the replications. Pool profiles are generated by injecting a blend of independently amplified PCR products on a capillary electrophoresis instrument. Both approaches resulted in a similar increase in the percentage of detected alleles compared to individual profiles, and both rarely included drop-in alleles in case mixtures of pristine DNAs were used. Interestingly, the consensus and the pool profiles often showed differences for the actual alleles detected for the LT component(s). We estimated the number of contributors using different methods. Better approximations were obtained with data in the consensus and pool profiles compared to the data of the individual profiles. Consensus profiles contain allele calls only, while pool profiles consist of both allele calls and peak height information, which can be of use in (statistical) profile analysis. All advantages and limitations of the various types of profiles were assessed, and based on the results we infer that both consensus and pool profiles (or a combination thereof) are helpful in the interpretation of complex LT DNA mixtures.
Electronic supplementary material
The online version of this article (doi:10.1007/s00414-011-0647-5) contains supplementary material, which is available to authorized users.
Forensic science; Complex mixtures; Low template STR typing; Consensus method; Pool; Profile; Next Generation Multiplex (NGM)
Single nucleotide polymorphisms (SNPs) are indispensable in such applications as association mapping and construction of high-density genetic maps. These applications usually require genotyping of thousands of SNPs in a large number of individuals. Although a number of SNP genotyping assays are available, most of them are designed for SNP genotyping in diploid individuals. Here, we demonstrate that the Illumina GoldenGate assay could be used for SNP genotyping of homozygous tetraploid and hexaploid wheat lines. Genotyping reactions could be carried out directly on genomic DNA without the necessity of preliminary PCR amplification. A total of 53 tetraploid and 38 hexaploid homozygous wheat lines were genotyped at 96 SNP loci. The genotyping error rate estimated after removal of low-quality data was 0 and 1% for tetraploid and hexaploid wheat, respectively. Developed SNP genotyping assays were shown to be useful for genotyping wheat cultivars. This study demonstrated that the GoldenGate assay is a very efficient tool for high-throughput genotyping of polyploid wheat, opening new possibilities for the analysis of genetic variation in wheat and dissection of genetic basis of complex traits using association mapping approach.
Electronic supplementary material
The online version of this article (doi:10.1007/s00122-009-1059-5) contains supplementary material, which is available to authorized users.
Odors are rarely composed of a single compound, but rather contain a large and complex variety of chemical components. Often, these mixtures are perceived as having unique qualities that can be quite different than the combination of their components. In many cases, a majority of the components of a mixture cannot be individually identified. This synthetic processing of odor information suggests that individual component representations of the mixture must interact somewhere along the olfactory pathway. The anatomical nature of sensory neuron input into segregated glomeruli with the bulb suggests that initial input of odor information into the bulb is analytic. However, a large network of interneurons within the olfactory bulb could allow for mixture interactions via mechanisms such as lateral inhibition. Currently in mammals, it is unclear if postsynaptic mitral/tufted cell glomerular mixture responses reflect the analytical mixture input, or provide the initial basis for synthetic processing with the olfactory system. To address this, olfactory bulb glomerular binary mixture representations were compared to representations of each component using transgenic mice expressing the calcium indicator G-CaMP2 in olfactory bulb mitral/tufted cells. Overall, dorsal surface mixture representations showed little mixture interaction and often appeared as a simple combination of the component representations. Based on this, it is concluded that dorsal surface glomerular mixture representations remain largely analytical with nearly all component information preserved.
The success of genome-wide association studies has paralleled the development of efficient genotyping technologies. We describe the development of a next-generation microarray based on the new highly-efficient Affymetrix Axiom genotyping technology that we are using to genotype individuals of European ancestry from the Kaiser Permanente Research Program on Genes, Environment and Health (RPGEH). The array contains 674,517 SNPs, and provides excellent genome-wide as well as gene-based and candidate-SNP coverage. Coverage was calculated using an approach based on imputation and cross validation. Preliminary results for the first 80,301 saliva-derived DNA samples from the RPGEH demonstrate very high quality genotypes, with sample success rates above 94% and over 98% of successful samples having SNP call rates exceeding 98%. At steady state, we have produced 462 million genotypes per week for each Axiom system. The new array provides a valuable addition to the repertoire of tools for large scale genome-wide association studies.
Microarray; Genome-wide association study; Coverage; Throughput; Single nucleotide polymorphism
Reduced mobilities, resolving powers and detection limits for 12 ribonucleotides and 4 ribonucleosides were measured by ambient pressure electrospray ionization ion mobility spectrometry (ESI-IMS). With the instrument used in this study it was possible to separate some of these compounds within the mixtures. In addition, the detection limits reported for the ribonucleotides and ribonucleosides ranged from 15 to 300 picomoles whereas resolving power ranged from 41 to 56 suggesting that ambient pressure ESI-IMS may be used for their rapid and sensitive separation and detection. This short report demonstrates that it was possible to use IMS for the separation of nucleotides and nucleosides in less than one second. The application holds great promise for nucleotide analysis in the area of separating DNA fragments in genome sequencing and also for forensics DNA typing examinations used for the identification of blood stains in crime scenes and paternity testing.
Electrospray ionization; Ion mobility spectrometry; Nucleotides; Detection limit; Resolving power
Francisella tularensis contains several highly pathogenic subspecies, including Francisella tularensis subsp. holarctica, whose distribution is circumpolar in the northern hemisphere. The phylogeography of these subspecies and their subclades was examined using whole-genome single nucleotide polymorphism (SNP) analysis, high-density microarray SNP genotyping, and real-time-PCR-based canonical SNP (canSNP) assays. Almost 30,000 SNPs were identified among 13 whole genomes for phylogenetic analysis. We selected 1,655 SNPs to genotype 95 isolates on a high-density microarray platform. Finally, 23 clade- and subclade-specific canSNPs were identified and used to genotype 496 isolates to establish global geographic genetic patterns. We confirm previous findings concerning the four subspecies and two Francisella tularensis subsp. tularensis subpopulations and identify additional structure within these groups. We identify 11 subclades within F. tularensis subsp. holarctica, including a new, genetically distinct subclade that appears intermediate between Japanese F. tularensis subsp. holarctica isolates and the common F. tularensis subsp. holarctica isolates associated with the radiation event (the B radiation) wherein this subspecies spread throughout the northern hemisphere. Phylogenetic analyses suggest a North American origin for this B-radiation clade and multiple dispersal events between North America and Eurasia. These findings indicate a complex transmission history for F. tularensis subsp. holarctica.
Following the completion of reverse transcription, the human immunodeficiency virus integrase (IN) enzyme covalently links the viral cDNA to a host cell chromosome. An IN multimer carries out this reaction, but the roles of individual monomers within the complex are mostly unknown. Here we analyzed the distribution of functions for target DNA capture and catalysis within the IN multimer. We used forced complementation between pairs of IN deletion derivatives in vitro as a tool for probing cis-trans relationships and analyzed amino acid substitutions affecting either catalysis or target site selection within these complementing complexes. This allowed the demonstration that the IN variant contributing the active catalytic domain was also responsible for recognition of the integration target DNA. We were further able to establish that a single monomer is responsible for both functions by use of assay mixtures containing three different IN genotypes. These data specify the ligands bound at the catalytically relevant IN monomer and allow more-specific modeling of the mechanism of inhibitors that also bind this surface of IN.
Many lines of evidence suggest that mitochondrial DNA (mtDNA) variants are involved in the pathogenesis of human complex diseases, especially for age-related disorders. Osteoporosis is a typical age-related complex disease. However, the role of mtDNA variants in the susceptibility of osteoporosis is largely unknown. In this study, we performed a mitochondria-wide association study for osteoporosis in Caucasians. A total of 445 mitochondrial single nucleotide polymorphisms (mtSNPs) were genotyped in a large sample of 2,286 unrelated Caucasian subjects by using the Affymetrix Genome-Wide SNP Array 6.0, and 72 mtSNPs survived the quality control. We first tested for association between single-mtSNP and bone mineral density (BMD), and identified that, a mtSNP within the NADH dehydrogenase 2 gene (ND2), mt4823 C/A polymorphism, was strongly associated with hip BMD (P = 2.05 × 10−4), even after conservative Bonferroni correction‥ The C allele of mt4823 was associated with reduced hip BMD and the effect size (β) was estimated to be ~0.044. Another SNP mt15885 within the Cytochrome b gene (Cytb) was found to be associated both with spine (P = 1.66×10−3) and hip BMD (P = 0.023). The T allele of mt15885 had a protective effect on spine (β = 0.064) and hip BMD (β = 0.038). Next, we classified subjects into the nine common European haplogroups and conducted association analyses. Subjects classified as haplogroup X had significantly lower mean hip BMD values than others (P = 0.040). Our results highlighted the importance of mtDNA variants in influencing BMD variation and risk to osteoporosis.
mtSNP; haplogroup; osteoporosis; BMD; association
To isolate mucosal cells of the perpetrator in a sexual assault case from a complex mixture of his mucosal cells and the victim’s skin by micromanipulation prior to genomic analysis.
To capture and analyze mucosal cells we used the micromanipulation with on-chip low volume polymerase chain reaction (LV-PCR). Consensus DNA profiles were generated from 5 replicate experiments.
Results and conclusions
We validated the use of micromanipulation with on-chip LV-PCR for genomic analysis of complex biological mixtures in a fatal rape case. The perpetrator’s mucosal cells were captured from nipple swabs of the victim, and a single-source DNA profile was generated from cell mixtures. These data suggest that micromanipulation with on-chip LV-PCR is an effective forensic tool for the analysis of specific cells from complex samples.
High-density genotyping arrays that measure hybridization of genomic DNA fragments to allele-specific oligonucleotide probes are widely used to genotype single nucleotide polymorphisms (SNPs) in genetic studies, including human genome-wide association studies. Hybridization intensities are converted to genotype calls by clustering algorithms that assign each sample to a genotype class at each SNP. Data for SNP probes that do not conform to the expected pattern of clustering are often discarded, contributing to ascertainment bias and resulting in lost information - as much as 50% in a recent genome-wide association study in dogs.
We identified atypical patterns of hybridization intensities that were highly reproducible and demonstrated that these patterns represent genetic variants that were not accounted for in the design of the array platform. We characterized variable intensity oligonucleotide (VINO) probes that display such patterns and are found in all hybridization-based genotyping platforms, including those developed for human, dog, cattle, and mouse. When recognized and properly interpreted, VINOs recovered a substantial fraction of discarded probes and counteracted SNP ascertainment bias. We developed software (MouseDivGeno) that identifies VINOs and improves the accuracy of genotype calling. MouseDivGeno produced highly concordant genotype calls when compared with other methods but it uniquely identified more than 786000 VINOs in 351 mouse samples. We used whole-genome sequence from 14 mouse strains to confirm the presence of novel variants explaining 28000 VINOs in those strains. We also identified VINOs in human HapMap 3 samples, many of which were specific to an African population. Incorporating VINOs in phylogenetic analyses substantially improved the accuracy of a Mus species tree and local haplotype assignment in laboratory mouse strains.
The problems of ascertainment bias and missing information due to genotyping errors are widely recognized as limiting factors in genetic studies. We have conducted the first formal analysis of the effect of novel variants on genotyping arrays, and we have shown that these variants account for a large portion of miscalled and uncalled genotypes. Genetic studies will benefit from substantial improvements in the accuracy of their results by incorporating VINOs in their analyses.
The identification of copy number aberration in the human genome is an important area in cancer research. We develop a model for determining genomic copy numbers using high-density single nucleotide polymorphism genotyping microarrays. The method is based on a Bayesian spatial normal mixture model with an unknown number of components corresponding to true copy numbers. A reversible jump Markov chain Monte Carlo algorithm is used to implement the model and perform posterior inference.
The performance of the algorithm is examined on both simulated and real cancer data, and it is compared with the popular CNAG algorithm for copy number detection.
We demonstrate that our Bayesian mixture model performs at least as well as the hidden Markov model based CNAG algorithm and in certain cases does better. One of the added advantages of our method is the flexibility of modeling normal cell contamination in tumor samples.
We report an attempt to extend the previously successful approach of combining SNP (single nucleotide polymorphism) microarrays and DNA pooling (SNP-MaP) employing high-density microarrays. Whereas earlier studies employed a range of Affymetrix SNP microarrays comprising from 10 K to 500 K SNPs, this most recent investigation used the 6.0 chip which displays 906,600 SNP probes and 946,000 probes for the interrogation of CNVs (copy number variations). The genotyping assay using the Affymetrix SNP 6.0 array is highly demanding on sample quality due to the small feature size, low redundancy, and lack of mismatch probes.
In the first study published so far using this microarray on pooled DNA, we found that pooled cheek swab DNA could not accurately predict real allele frequencies of the samples that comprised the pools. In contrast, the allele frequency estimates using blood DNA pools were reasonable, although inferior compared to those obtained with previously employed Affymetrix microarrays. However, it might be possible to improve performance by developing improved analysis methods.
Despite the decreasing costs of genome-wide individual genotyping, the pooling approach may have applications in very large-scale case-control association studies. In such cases, our study suggests that high-quality DNA preparations and lower density platforms should be preferred.
Single nucleotide polymorphisms (SNPs) are growing in popularity as a genetic marker for investigating evolutionary processes. A panel of SNPs is often developed by comparing large quantities of DNA sequence data across multiple individuals to identify polymorphic sites. For non-model species, this is particularly difficult, as performing the necessary large-scale genomic sequencing often exceeds the resources available for the project. In this study, we trial the Bovine SNP50 BeadChip developed in cattle (Bos taurus) for identifying polymorphic SNPs in cervids Odocoileus hemionus (mule deer and black-tailed deer) and O. virginianus (white-tailed deer) in the Pacific Northwest. We found that 38.7% of loci could be genotyped, of which 5% (n = 1068) were polymorphic. Of these 1068 polymorphic SNPs, a mixture of putatively neutral loci (n = 878) and loci under selection (n = 190) were identified with the FST-outlier method. A range of population genetic analyses were implemented using these SNPs and a panel of 10 microsatellite loci. The three types of deer could readily be distinguished with both the SNP and microsatellite datasets. This study demonstrates that commercially developed SNP chips are a viable means of SNP discovery for non-model organisms, even when used between very distantly related species (the Bovidae and Cervidae families diverged some 25.1−30.1 million years before present).
The recent discovery of widespread copy number variation in humans has forced a shift away from the assumption of two copies per locus per cell throughout the autosomal genome. In particular, a SNP site can no longer always be accurately assigned one of three genotypes in an individual. In the presence of copy number variability, the individual may theoretically harbor any number of copies of each of the two SNP alleles.
To address this issue, we have developed a method to infer a "generalized genotype" from raw SNP microarray data. Here we apply our approach to data from 48 individuals and uncover thousands of aberrant SNPs, most in regions that were previously unreported as copy number variants. We show that our allele-specific copy numbers follow Mendelian inheritance patterns that would be obscured in the absence of SNP allele information. The interplay between duplication and point mutation in our data shed light on the relative frequencies of these events in human history, showing that at least some of the duplication events were recurrent.
This new multi-allelic view of SNPs has a complicated role in disease association studies, and further work will be necessary in order to accurately assess its importance. Software to perform generalized genotyping from SNP array data is freely available online .
Forensic and ancient DNA (aDNA) extracts are mixtures of endogenous aDNA, existing in more or less damaged state, and contaminant DNA. To obtain the true aDNA sequence, it is not sufficient to generate a single direct sequence of the mixture, even where the authentic aDNA is the most abundant (e.g. 25% or more) in the component mixture. Only bacterial cloning can elucidate the components of this mixture. We calculate the number of clones that need to be sampled (for various mixture ratios) in order to be confident (at various levels of confidence) to have identified the major component. We demonstrate that to be >95% confident of identifying the most abundant sequence present at 70% in the ancient sample, 20 clones must be sampled. We make recommendations and offer a free-access web-based program, which constructs the most reliable consensus sequence from the user's input clone sequences and analyses the confidence limits for each nucleotide position and for the whole consensus sequence. Accepted authentication methods must be employed in order to assess the authenticity and endogeneity of the resulting consensus sequences (e.g. quantification and replication by another laboratory, blind testing, amelogenin sex versus morphological sex, the effective use of controls, etc.) and determine whether they are indeed aDNA.