The tomato (Solanum lycopersicum L.) is a model plant for genome research in Solanaceae, as well as for studying crop breeding. Genome-wide single nucleotide polymorphisms (SNPs) are a valuable resource in genetic research and breeding. However, to do discovery of genome-wide SNPs, most methods require expensive high-depth sequencing. Here, we describe a method for SNP calling using a modified version of SAMtools that improved its sensitivity. We analyzed 90 Gb of raw sequence data from next-generation sequencing of two resequencing and seven transcriptome data sets from several tomato accessions. Our study identified 4,812,432 non-redundant SNPs. Moreover, the workflow of SNP calling was improved by aligning the reference genome with its own raw data. Using this approach, 131,785 SNPs were discovered from transcriptome data of seven accessions. In addition, 4,680,647 SNPs were identified from the genome of S. pimpinellifolium, which are 60 times more than 71,637 of the PI212816 transcriptome. SNP distribution was compared between the whole genome and transcriptome of S. pimpinellifolium. Moreover, we surveyed the location of SNPs within genic and intergenic regions. Our results indicated that the sufficient genome-wide SNP markers and very sensitive SNP calling method allow for application of marker assisted breeding and genome-wide association studies.
next generation sequencing (NGS); single nucleotide polymorphism (SNP); tomato
Genome-wide association studies (GWAS) aim to identify genetic variants (usually single nucleotide polymorphisms [SNPs]) across the entire human genome that are associated with phenotypic traits such as disease status and drug response. Highly accurate and reproducible genotype calling are paramount since errors introduced by calling algorithms can lead to inflation of false associations between genotype and phenotype. Most genotype calling algorithms currently used for GWAS are based on multiple arrays. Because hundreds of gigabytes (GB) of raw data are generated from a GWAS, the samples are typically partitioned into batches containing subsets of the entire dataset for genotype calling. High call rates and accuracies have been achieved. However, the effects of batch size (i.e., number of chips analyzed together) and of batch composition (i.e., the choice of chips in a batch) on call rate and accuracy as well as the propagation of the effects into significantly associated SNPs identified have not been investigated. In this paper, we analyzed both the batch size and batch composition for effects on the genotype calling algorithm BRLMM using raw data of 270 HapMap samples analyzed with the Affymetrix Human Mapping 500 K array set.
Using data from 270 HapMap samples interrogated with the Affymetrix Human Mapping 500 K array set, three different batch sizes and three different batch compositions were used for genotyping using the BRLMM algorithm. Comparative analysis of the calling results and the corresponding lists of significant SNPs identified through association analysis revealed that both batch size and composition affected genotype calling results and significantly associated SNPs. Batch size and batch composition effects were more severe on samples and SNPs with lower call rates than ones with higher call rates, and on heterozygous genotype calls compared to homozygous genotype calls.
Batch size and composition affect the genotype calling results in GWAS using BRLMM. The larger the differences in batch sizes, the larger the effect. The more homogenous the samples in the batches, the more consistent the genotype calls. The inconsistency propagates to the lists of significantly associated SNPs identified in downstream association analysis. Thus, uniform and large batch sizes should be used to make genotype calls for GWAS. In addition, samples of high homogeneity should be placed into the same batch.
Summary: Genotype calling from high-throughput platforms such as Illumina and Affymetrix is a critical step in data processing, so that accurate information on genetic variants can be obtained for phenotype–genotype association studies. A number of algorithms have been developed to infer genotypes from data generated through the Illumina BeadStation platform, including GenCall, GenoSNP, Illuminus and CRLMM. Most of these algorithms are built on population-based statistical models to genotype every SNP in turn, such as GenCall with the GenTrain clustering algorithm, and require a large reference population to perform well. These approaches may not work well for rare variants where only a small proportion of the individuals carry the variant. A fundamentally different approach, implemented in GenoSNP, adopts a single nucleotide polymorphism (SNP)-based model to infer genotypes of all the SNPs in one individual, making it an appealing alternative to call rare variants. However, compared to the population-based strategies, more SNPs in GenoSNP may fail the Hardy–Weinberg Equilibrium test. To take advantage of both strategies, we propose a two-stage SNP calling procedure, named the modified mixture model (M3), to improve call accuracy for both common and rare variants. The effectiveness of our approach is demonstrated through applications to genotype calling on a set of HapMap samples used for quality control purpose in a large case–control study of cocaine dependence. The increase in power with M3 is greater for rare variants than for common variants depending on the model.
Availability: M3 algorithm: http://bioinformatics.med.yale.edu/group.
Contact: firstname.lastname@example.org; email@example.com
Supplementary information: Supplementary data are available at Bioinformatics online.
Illumina's Infinium SNP BeadChips are extensively used in both small and large-scale genetic studies. A fundamental step in any analysis is the processing of raw allele A and allele B intensities from each SNP into genotype calls (AA, AB, BB). Various algorithms which make use of different statistical models are available for this task. We compare four methods (GenCall, Illuminus, GenoSNP and CRLMM) on data where the true genotypes are known in advance and data from a recently published genome-wide association study.
In general, differences in accuracy are relatively small between the methods evaluated, although CRLMM and GenoSNP were found to consistently outperform GenCall. The performance of Illuminus is heavily dependent on sample size, with lower no call rates and improved accuracy as the number of samples available increases. For X chromosome SNPs, methods with sex-dependent models (Illuminus, CRLMM) perform better than methods which ignore gender information (GenCall, GenoSNP). We observe that CRLMM and GenoSNP are more accurate at calling SNPs with low minor allele frequency than GenCall or Illuminus. The sample quality metrics from each of the four methods were found to have a high level of agreement at flagging samples with unusual signal characteristics.
CRLMM, GenoSNP and GenCall can be applied with confidence in studies of any size, as their performance was shown to be invariant to the number of samples available. Illuminus on the other hand requires a larger number of samples to achieve comparable levels of accuracy and its use in smaller studies (50 or fewer individuals) is not recommended.
We present GStream, a method that combines genome-wide SNP and CNV genotyping in the Illumina microarray platform with unprecedented accuracy. This new method outperforms previous well-established SNP genotyping software. More importantly, the CNV calling algorithm of GStream dramatically improves the results obtained by previous state-of-the-art methods and yields an accuracy that is close to that obtained by purely CNV-oriented technologies like Comparative Genomic Hybridization (CGH). We demonstrate the superior performance of GStream using microarray data generated from HapMap samples. Using the reference CNV calls generated by the 1000 Genomes Project (1KGP) and well-known studies on whole genome CNV characterization based either on CGH or genotyping microarray technologies, we show that GStream can increase the number of reliably detected variants up to 25% compared to previously developed methods. Furthermore, the increased genome coverage provided by GStream allows the discovery of CNVs in close linkage disequilibrium with SNPs, previously associated with disease risk in published Genome-Wide Association Studies (GWAS). These results could provide important insights into the biological mechanism underlying the detected disease risk association. With GStream, large-scale GWAS will not only benefit from the combined genotyping of SNPs and CNVs at an unprecedented accuracy, but will also take advantage of the computational efficiency of the method.
High throughput microarray-based single nucleotide polymorphism (SNP) genotyping has revolutionized the way genome-wide linkage scans and association analyses are performed. One of the key features of the array-based GeneChip® Mapping 10K Array from Affymetrix is the automated SNP calling algorithm. The Affymetrix algorithm was trained on a database of ethnically diverse DNA samples to create SNP call zones that are used as static models to make genotype calls for experimental data. We describe here the implementation of clustering algorithms on large training datasets resulting in improved SNP call rates on the 10K GeneChip.
A database of 948 individuals genotyped on the GeneChip® Mapping 10K 2.0 Array was used to identify 822 SNPs that were called consistently less than 75% of the time. These SNPs represent on average 8.25% of the total SNPs on each chromosome with chromosome 19, the most gene-rich chromosome, containing the highest proportion of poor performers (18.7%). To remedy this, we created SNiPer, a new application which uses two clustering algorithms to yield increased call rates and equivalent concordance to Affymetrix called genotypes. We include a training set for these algorithms based on individual genotypes for 705 samples. SNiPer has the capability to be retrained for lab-specific training sets. SNiPer is freely available for download at .
The correct calling of poor performing SNPs may prove to be key in future linkage studies performed on the 10K GeneChip. It would prove particularly invaluable for those diseases that map to chromosome 19, known to contain a high proportion of poorly performing SNPs. Our results illustrate that SNiPer can be used to increase call rates on the 10K GeneChip® without sacrificing accuracy, thereby increasing the amount of valid data generated.
Single nucleotide polymorphism (SNP) and copy number variation (CNV) are both widespread characteristic of the human genome, but are often called separately on common genotyping platforms. To capture integrated SNP and CNV information, methods have been developed for calling allelic specific copy numbers or so called copy number polymorphism (CNP), using limited inter-marker correlation. In this paper, we proposed a haplotype-based maximum likelihood method to call CNP, which takes advantage of the valuable multi-locus linkage disequilibrium (LD) information in the population. We also developed a computationally efficient algorithm to estimate haplotype frequencies and optimize individual CNP calls iteratively, even at presence of missing data. Through simulations, we demonstrated our model is more sensitive and accurate in detecting various CNV regions, compared with commonly-used CNV calling methods including PennCNV, another hidden Markov model (HMM) using CNP, a scan statistic, segCNV, and cnvHap. Our method often performs better in the regions with higher LD, in longer CNV regions, and in common CNV than the opposite. We implemented our method on the genotypes of 90 HapMap CEU samples and 23 patients with acute lung injury (ALI). For each ALI patient the genotyping was performed twice. The CNPs from our method show good consistency and accuracy comparable to others.
CNV; CNP; GWAS; haplotype; joint SNP and CNV calling; integrated SNP and CNV
Recent technology advances have enabled sequencing of individual genomes, promising to revolutionize biomedical research. However, deep sequencing remains more expensive than microarrays for performing whole-genome SNP genotyping.
In this paper we introduce a new multi-locus statistical model and computationally efficient genotype calling algorithms that integrate shotgun sequencing data with linkage disequilibrium (LD) information extracted from reference population panels such as Hapmap or the 1000 genomes project. Experiments on publicly available 454, Illumina, and ABI SOLiD sequencing datasets suggest that integration of LD information results in genotype calling accuracy comparable to that of microarray platforms from sequencing data of low-coverage. A software package implementing our algorithm, released under the GNU General Public License, is available at http://dna.engr.uconn.edu/software/GeneSeq/.
Integration of LD information leads to significant improvements in genotype calling accuracy compared to prior LD-oblivious methods, rendering low-coverage sequencing as a viable alternative to microarrays for conducting large-scale genome-wide association studies.
The growing accessibility to genomic resources using next-generation sequencing (NGS) technologies has revolutionized the application of molecular genetic tools to ecology and evolutionary studies in non-model organisms. Here we present the case study of the European hake (Merluccius merluccius), one of the most important demersal resources of European fisheries. Two sequencing platforms, the Roche 454 FLX (454) and the Illumina Genome Analyzer (GAII), were used for Single Nucleotide Polymorphisms (SNPs) discovery in the hake muscle transcriptome. De novo transcriptome assembly into unique contigs, annotation, and in silico SNP detection were carried out in parallel for 454 and GAII sequence data. High-throughput genotyping using the Illumina GoldenGate assay was performed for validating 1,536 putative SNPs. Validation results were analysed to compare the performances of 454 and GAII methods and to evaluate the role of several variables (e.g. sequencing depth, intron-exon structure, sequence quality and annotation). Despite well-known differences in sequence length and throughput, the two approaches showed similar assay conversion rates (approximately 43%) and percentages of polymorphic loci (67.5% and 63.3% for GAII and 454, respectively). Both NGS platforms therefore demonstrated to be suitable for large scale identification of SNPs in transcribed regions of non-model species, although the lack of a reference genome profoundly affects the genotyping success rate. The overall efficiency, however, can be improved using strict quality and filtering criteria for SNP selection (sequence quality, intron-exon structure, target region score).
Motivation: Most genotyping technologies for single nucleotide polymorphism (SNP) markers use standard clustering methods to ‘call’ the SNP genotypes. These methods are not always optimal in distinguishing the genotype clusters of a SNP because they do not take advantage of specific features of the genotype calling problem. In particular, when family data are available, pedigree information is ignored. Furthermore, prior information about the distribution of the measurements for each cluster can be used to choose an appropriate model-based clustering method and can significantly improve the genotype calls. One special genotyping problem that has never been discussed in the literature is that of genotyping of trisomic individuals, such as individuals with Down syndrome. Calling trisomic genotypes is a more complicated problem, and the addition of external information becomes very important.
Results: In this article, we discuss the impact of incorporating external information into clustering algorithms to call the genotypes for both disomic and trisomic data. We also propose two new methods to call genotypes using family data. One is a modification of the K-means method and uses the pedigree information by updating all members of a family together. The other is a likelihood-based method that combines the Gaussian or beta-mixture model with pedigree information. We compare the performance of these two methods and some other existing methods using simulation studies. We also compare the performance of these methods on a real dataset generated by the Illumina platform (www.illumina.com).
Availability: The R code for the family-based genotype calling methods (SNPCaller) is available to be downloaded from the following website: http://watson.hgen.pitt.edu/register.
Supplementary information: Supplementary data are available at Bioinformatics online.
Single nucleotide polymorphisms (SNPs) are the most commonly used polymorphic markers in genetics studies. Among the different platforms for SNP genotyping, Luminex is one of the less exploited mainly due to the lack of a robust (semi-automated and replicable) freely available genotype calling software. Here we describe a clustering algorithm that provides automated SNP calls for Luminex genotyping assays. We genotyped 3 SNPs in a cohort of 330 childhood leukemia patients, 200 parents of patient and 325 healthy individuals and used the Automated Luminex Genotyping (ALG) algorithm for SNP calling. ALG genotypes were called twice to test for reproducibility and were compared to sequencing data to test for accuracy. Globally, this analysis demonstrates the accuracy (99.6%) of the method, its reproducibility (99.8%) and the low level of no genotyping calls (3.4%). The high efficiency of the method proves that ALG is a suitable alternative to the current commercial software. ALG is semi-automated, and provides numerical measures of confidence for each SNP called, as well as an effective graphical plot. Moreover ALG can be used either through a graphical user interface, requiring no specific informatics knowledge, or through command line with access to the open source code. The ALG software has been implemented in R and is freely available for non-commercial use either at http://alg.sourceforge.net or by request to firstname.lastname@example.org
Despite great progress in identifying genetic variants that influence human disease, most inherited risk remains unexplained. A more complete understanding requires genome-wide studies that fully examine less common alleles in populations with a wide range of ancestry. To inform the design and interpretation of such studies, we genotyped 1.6 million common single nucleotide polymorphisms (SNPs) in 1,184 reference individuals from 11 global populations, and sequenced ten 100-kilobase regions in 692 of these individuals. This integrated data set of common and rare alleles, called ‘HapMap 3’, includes both SNPs and copy number polymorphisms (CNPs). We characterized population-specific differences among low-frequency variants, measured the improvement in imputation accuracy afforded by the larger reference panel, especially in imputing SNPs with a minor allele frequency of ≤5%, and demonstrated the feasibility of imputing newly discovered CNPs and SNPs. This expanded public resource of genome variants in global populations supports deeper interrogation of genomic variation and its role in human disease, and serves as a step towards a high-resolution map of the landscape of human genetic variation.
Accurate calling of SNPs and genotypes from next-generation sequencing data is an essential prerequisite for most human genetics studies. A number of computational steps are required or recommended when translating the raw sequencing data into the final calls. However, whether each step does contribute to the performance of variant calling and how it affects the accuracy still remain unclear, making it difficult to select and arrange appropriate steps to derive high quality variants from different sequencing data. In this study, we made a systematic assessment of the relative contribution of each step to the accuracy of variant calling from Illumina DNA sequencing data.
We found that the read preprocessing step did not improve the accuracy of variant calling, contrary to the general expectation. Although trimming off low-quality tails helped align more reads, it introduced lots of false positives. The ability of markup duplication, local realignment and recalibration, to help eliminate false positive variants depended on the sequencing depth. Rearranging these steps did not affect the results. The relative performance of three popular multi-sample SNP callers, SAMtools, GATK, and GlfMultiples, also varied with the sequencing depth.
Our findings clarify the necessity and effectiveness of computational steps for improving the accuracy of SNP and genotype calls from Illumina sequencing data and can serve as a general guideline for choosing SNP calling strategies for data with different coverage.
Next-generation sequencing (NGS) approaches are widely used in genome-wide genetic marker discovery and genotyping. However, current NGS approaches are not easy to apply to general outbred populations (human and some major farm animals) for SNP identification because of the high level of heterogeneity and phase ambiguity in the haplotype. Here, we reported a new method for SNP genotyping, called genotyping by genome reducing and sequencing (GGRS) to genotype outbred species. Through an improved procedure for library preparation and a marker discovery and genotyping pipeline, the GGRS approach can genotype outbred species cost-effectively and high-reproducibly. We also evaluated the efficiency and accuracy of our approach for high-density SNP discovery and genotyping in a large genome pig species (2.8 Gb), for which more than 70,000 single nucleotide polymorphisms (SNPs) can be identified for an expenditure of only $80 (USD)/sample.
The sequencing of pooled non-barcoded individuals is an inexpensive and efficient means of assessing genome-wide population allele frequencies, yet its accuracy has not been thoroughly tested. We assessed the accuracy of this approach on whole, complex eukaryotic genomes by resequencing pools of largely isogenic, individually sequenced Drosophila melanogaster strains. We called SNPs in the pooled data and estimated false positive and false negative rates using the SNPs called in individual strain as a reference. We also estimated allele frequency of the SNPs using “pooled” data and compared them with “true” frequencies taken from the estimates in the individual strains. We demonstrate that pooled sequencing provides a faithful estimate of population allele frequency with the error well approximated by binomial sampling, and is a reliable means of novel SNP discovery with low false positive rates. However, a sufficient number of strains should be used in the pooling because variation in the amount of DNA derived from individual strains is a substantial source of noise when the number of pooled strains is low. Our results and analysis confirm that pooled sequencing is a very powerful and cost-effective technique for assessing of patterns of sequence variation in populations on genome-wide scales, and is applicable to any dataset where sequencing individuals or individual cells is impossible, difficult, time consuming, or expensive.
With improvements in genotyping technologies, genome-wide association studies with hundreds of thousands of SNPs allow the identification of candidate genetic loci for multifactorial diseases in different populations. However, genotyping errors caused by genotyping platforms or genotype calling algorithms may lead to inflation of false associations between markers and phenotypes. In addition, the number of SNPs available for genome-wide association studies in the Japanese population has been investigated using only 45 samples in the HapMap project, which could lead to an inaccurate estimation of the number of SNPs with low minor allele frequencies. We genotyped 400 Japanese samples in order to estimate the number of SNPs available for genome-wide association studies in the Japanese population and to examine the performance of the current SNP Array 6.0 platform and the genotype calling algorithm "Birdseed".
About 20% of the 909,622 SNP markers on the array were revealed to be monomorphic in the Japanese population. Consequently, 661,599 SNPs were available for genome-wide association studies in the Japanese population, after excluding the poorly behaving SNPs. The Birdseed algorithm accurately determined the genotype calls of each sample with a high overall call rate of over 99.5% and a high concordance rate of over 99.8% using more than 48 samples after removing low-quality samples by adjusting QC criteria.
Our results confirmed that the SNP Array 6.0 platform reached the level reported by the manufacturer, and thus genome-wide association studies using the SNP Array 6.0 platform have considerable potential to identify candidate susceptibility or resistance genetic factors for multifactorial diseases in the Japanese population, as well as in other populations.
High throughput next-generation sequencing technology has enabled the collection of genome-wide sequence data and revolutionized single nucleotide polymorphism (SNP) discovery in a broad range of species. When analyzed within a population genomics framework, SNP-based genotypic data may be used to investigate questions of evolutionary, ecological, and conservation significance in natural populations of non-model organisms. Kokanee salmon are recently diverged freshwater populations of sockeye salmon (Oncorhynchus nerka) that exhibit reproductive ecotypes (stream-spawning and shore-spawning) in lakes throughout western North America and northeast Asia. Current conservation and management strategies may treat these ecotypes as discrete stocks, however their recent divergence and low levels of gene flow make in-season genetic stock identification a challenge. The development of genome-wide SNP markers is an essential step towards fine-scale stock identification, and may enable a direct investigation of the genetic basis of ecotype divergence.
We used pooled cDNA samples from both ecotypes of kokanee to generate 750 million base pairs of transcriptome sequence data. These raw data were assembled into 11,074 high coverage contigs from which we identified 32,699 novel single nucleotide polymorphisms. A subset of these putative SNPs was validated using high-resolution melt analysis and Sanger resequencing to genotype independent samples of kokanee and anadromous sockeye salmon. We also identified a number of contigs that were composed entirely of reads from a single ecotype, which may indicate regions of differential gene expression between the two reproductive ecotypes. In addition, we found some evidence for greater pathogen load among the kokanee sampled in stream-spawning habitats, suggesting a possible evolutionary advantage to shore-spawning that warrants further study.
This study provides novel genomic resources to support population genetic and genomic studies of both kokanee and anadromous sockeye salmon, and has the potential to produce markers capable of fine-scale stock assessment. While this RNAseq approach was successful at identifying a large number of new SNP loci, we found that the frequency of alleles present in the pooled transcriptome data was not an accurate predictor of population allele frequencies.
Adaptation; High resolution melt analysis; Next-generation sequencing; Oncorhynchus nerka; Single nucleotide polymorphisms
Human genome contains millions of common single nucleotide polymorphisms (SNPs) and these SNPs play an important role in understanding the association between genetic variations and human diseases. Many SNPs show correlated genotypes, or linkage disequilibrium (LD), thus it is not necessary to genotype all SNPs for association study. Many algorithms have been developed to find a small subset of SNPs called tag SNPs that are sufficient to infer all the other SNPs. Algorithms based on the r2 LD statistic have gained popularity because r2 is directly related to statistical power to detect disease associations. Most of existing r2 based algorithms use pairwise LD. Recent studies show that multi-marker LD can help further reduce the number of tag SNPs. However, existing tag SNP selection algorithms based on multi-marker LD are both time-consuming and memory-consuming. They cannot work on chromosomes containing more than 100 k SNPs using length-3 tagging rules.
We propose an efficient algorithm called FastTagger to calculate multi-marker tagging rules and select tag SNPs based on multi-marker LD. FastTagger uses several techniques to reduce running time and memory consumption. Our experiment results show that FastTagger is several times faster than existing multi-marker based tag SNP selection algorithms, and it consumes much less memory at the same time. As a result, FastTagger can work on chromosomes containing more than 100 k SNPs using length-3 tagging rules.
FastTagger also produces smaller sets of tag SNPs than existing multi-marker based algorithms, and the reduction ratio ranges from 3%-9% when length-3 tagging rules are used. The generated tagging rules can also be used for genotype imputation. We studied the prediction accuracy of individual rules, and the average accuracy is above 96% when r2 ≥ 0.9.
Generating multi-marker tagging rules is a computation intensive task, and it is the bottleneck of existing multi-marker based tag SNP selection methods. FastTagger is a practical and scalable algorithm to solve this problem.
In animals, the population genomic literature is dominated by two taxa, namely mammals and drosophilids, in which fully sequenced, well-annotated genomes have been available for years. Data from other metazoan phyla are scarce, probably because the vast majority of living species still lack a closely related reference genome. Here we achieve de novo, reference-free population genomic analysis from wild samples in five non-model animal species, based on next-generation sequencing transcriptome data. We introduce a pipe-line for cDNA assembly, read mapping, SNP/genotype calling, and data cleaning, with specific focus on the issue of hidden paralogy detection. In two species for which a reference genome is available, similar results were obtained whether the reference was used or not, demonstrating the robustness of our de novo inferences. The population genomic profile of a hare, a turtle, an oyster, a tunicate, and a termite were found to be intermediate between those of human and Drosophila, indicating that the discordant genomic diversity patterns that have been reported between these two species do not reflect a generalized vertebrate versus invertebrate gap. The genomic average diversity was generally higher in invertebrates than in vertebrates (with the notable exception of termite), in agreement with the notion that population size tends to be larger in the former than in the latter. The non-synonymous to synonymous ratio, however, did not differ significantly between vertebrates and invertebrates, even though it was negatively correlated with genetic diversity within each of the two groups. This study opens promising perspective regarding genome-wide population analyses of non-model organisms and the influence of population size on non-synonymous versus synonymous diversity.
The analysis of genomic variation between individuals of a given species has so far been restricted to a small number of model organisms, such as human and fruitfly, for which a fully sequenced, well-annotated reference genome was available. Here we show that, thanks to next-generation high-throughput sequencing technologies and appropriate genotype-calling methods, de novo population genomic analysis is possible in absence of a reference genome. We characterize the genomic level of neutral and selected polymorphism in five non-model animal species, two vertebrates and three invertebrates, paying particular attention to the treatment of multi-copy genes. The analyses demonstrate the influence of population size on genetic diversity in animals, the two vertebrates (hare, turtle) and the social insect (termite) being less polymorphic than the two marine invertebrates (oyster, tunicate) in our sample. Interestingly, genomic indicators of the efficiency of natural selection, both purifying and adaptive, did not vary in a simple, predictable way across organisms. These results prove the value of a diversified sampling of species when it comes to understand the determinants of genome evolutionary dynamics.
In addition to single-nucleotide polymorphisms (SNP), copy number variation (CNV) is a major component of human genetic diversity. Among many whole-genome analysis platforms, SNP arrays have been commonly used for genomewide CNV discovery. Recently, a number of CNV defining algorithms from SNP genotyping data have been developed; however, due to the fundamental limitation of SNP genotyping data for the measurement of signal intensity, there are still concerns regarding the possibility of false discovery or low sensitivity for detecting CNVs. In this study, we aimed to verify the effect of combining multiple CNV calling algorithms and set up the most reliable pipeline for CNV calling with Affymetrix Genomewide SNP 5.0 data. For this purpose, we selected the 3 most commonly used algorithms for CNV segmentation from SNP genotyping data, PennCNV, QuantiSNP; and BirdSuite. After defining the CNV loci using the 3 different algorithms, we assessed how many of them overlapped with each other, and we also validated the CNVs by genomic quantitative PCR. Through this analysis, we proposed that for reliable CNV-based genomewide association study using SNP array data, CNV calls must be performed with at least 3 different algorithms and that the CNVs consistently called from more than 2 algorithms must be used for association analysis, because they are more reliable than the CNVs called from a single algorithm. Our result will be helpful to set up the CNV analysis protocols for Affymetrix Genomewide SNP 5.0 genotyping data.
CNV defining algorithm; DNA copy number variations; SNP array
Current genome-wide association studies (GWAS) use commercial genotyping microarrays that can assay over a million single nucleotide polymorphisms (SNPs). The number of SNPs is further boosted by advanced statistical genotype-imputation algorithms and large SNP databases for reference human populations. The testing of a huge number of SNPs needs to be taken into account in the interpretation of statistical significance in such genome-wide studies, but this is complicated by the non-independence of SNPs because of linkage disequilibrium (LD). Several previous groups have proposed the use of the effective number of independent markers (Me) for the adjustment of multiple testing, but current methods of calculation for Me are limited in accuracy or computational speed. Here, we report a more robust and fast method to calculate Me. Applying this efficient method [implemented in a free software tool named Genetic type 1 error calculator (GEC)], we systematically examined the Me, and the corresponding p-value thresholds required to control the genome-wide type 1 error rate at 0.05, for 13 Illumina or Affymetrix genotyping arrays, as well as for HapMap Project and 1000 Genomes Project datasets which are widely used in genotype imputation as reference panels. Our results suggested the use of a p-value threshold of ~10−7 as the criterion for genome-wide significance for early commercial genotyping arrays, but slightly more stringent p-value thresholds ~5 × 10−8 for current or merged commercial genotyping arrays, ~10−8 for all common SNPs in the 1000 Genomes Project dataset and ~5 × 10−8 for the common SNPs only within genes.
Electronic supplementary material
The online version of this article (doi:10.1007/s00439-011-1118-2) contains supplementary material, which is available to authorized users.
Imputation-based association methods provide a powerful framework for testing untyped variants for association with phenotypes and for combining results from multiple studies that use different genotyping platforms. Here, we consider several issues that arise when applying these methods in practice, including: (i) factors affecting imputation accuracy, including choice of reference panel; (ii) the effects of imputation accuracy on power to detect associations; (iii) the relative merits of Bayesian and frequentist approaches to testing imputed genotypes for association with phenotype; and (iv) how to quickly and accurately compute Bayes factors for testing imputed SNPs. We find that imputation-based methods can be robust to imputation accuracy and can improve power to detect associations, even when average imputation accuracy is poor. We explain how ranking SNPs for association by a standard likelihood ratio test gives the same results as a Bayesian procedure that uses an unnatural prior assumption—specifically, that difficult-to-impute SNPs tend to have larger effects—and assess the power gained from using a Bayesian approach that does not make this assumption. Within the Bayesian framework, we find that good approximations to a full analysis can be achieved by simply replacing unknown genotypes with a point estimate—their posterior mean. This approximation considerably reduces computational expense compared with published sampling-based approaches, and the methods we present are practical on a genome-wide scale with very modest computational resources (e.g., a single desktop computer). The approximation also facilitates combining information across studies, using only summary data for each SNP. Methods discussed here are implemented in the software package BIMBAM, which is available from http://stephenslab.uchicago.edu/software.html.
Genotype imputation is becoming a popular approach to comparing and combining results of multiple association studies that used different SNP genotyping platforms. The basic idea is to exploit the fact that, due to correlation among untyped and typed SNPs, genotypes of untyped SNPs in each study can be inferred (“imputed”) from the genotypes at typed SNPs, often with high accuracy. In this paper, we consider several issues that arise when applying these methods in practice, including factors affecting imputation accuracy, the importance of taking account of imputation uncertainty when testing for association between imputed SNPs and phenotype, how imputation accuracy affects power, and how to combine results across studies when only single-SNP summary data can be shared among research groups.
Currently, genome-wide evaluation of cattle populations is based on SNP-genotyping using ~ 54 000 SNP. Increasing the number of markers might improve genomic predictions and power of genome-wide association studies. Imputation of genotypes makes it possible to extrapolate genotypes from lower to higher density arrays based on a representative reference sample for which genotypes are obtained at higher density.
Genotypes using 639 214 SNP were available for 797 bulls of the Fleckvieh cattle breed. The data set was divided into a reference and a validation population. Genotypes for all SNP except those included in the BovineSNP50 Bead chip were masked and subsequently imputed for animals of the validation population. Imputation of genotypes was performed with Beagle, findhap.f90, MaCH and Minimac. The accuracy of the imputed genotypes was assessed for four different scenarios including 50, 100, 200 and 400 animals as reference population. The reference animals were selected to account for 78.03%, 89.21%, 97.47% and > 99% of the gene pool of the genotyped population, respectively.
Imputation accuracy increased as the number of animals and relatives in the reference population increased. Population-based algorithms provided highly reliable imputation of genotypes, even for scenarios with 50 and 100 reference animals only. Using MaCH and Minimac, the correlation between true and imputed genotypes was > 0.975 with 100 reference animals only. Pre-phasing the genotypes of both the reference and validation populations not only provided highly accurate imputed genotypes but was also computationally efficient. Genome-wide analysis of imputation accuracy led to the identification of many misplaced SNP.
Genotyping key animals at high density and subsequent population-based genotype imputation yield high imputation accuracy. Pre-phasing the genotypes of the reference and validation populations is computationally efficient and results in high imputation accuracy, even when the reference population is small.
In genome-wide association studies, high-level statistical analyses rely on the validity of the called genotypes, and different genotype calling algorithms (GCAs) have been proposed. We compared the GCAs Bayesian robust linear modeling using Mahalanobis distance (BRLMM), Chiamo++, and JAPL using the autosomal single-nucleotide polymorphisms (SNPs) from the 500 k Affymetrix Array Set data of the Framingham Heart Study as provided for the Genetic Analysis Workshop 16, Problem 2, and prepared standard quality control (sQC) for each algorithm. Using JAPL, most individuals were retained for the analysis. The lowest number of SNPs that successfully passed sQC was observed for BRLMM and the highest for Chiamo++. All three GCAs fulfilled all sQC criteria for 79% of the SNPs but at least one GCA failed for 18% of the SNPs. Previously undetected errors in strand coding were identified by comparing genotype concordances between GCAs. Concordance dropped with the number of GCAs failing sQC. We conclude that JAPL and Chiamo++ are the GCAs of choice if the aim is to keep as many subjects and SNPs as possible, respectively.
Mapping short next-generation reads to reference genomes is an important element in SNP calling and expression studies. A major limitation to large-scale whole-genome mapping is the large memory requirements for the algorithm and the long run-time necessary for accurate studies. Several parallel implementations have been performed to distribute memory on different processors and to equally share the processing requirements. These approaches are compared with respect to their memory footprint, load balancing, and accuracy. When using MPI with multi-threading, linear speedup can be achieved for up to 256 processors.