Whole genome analysis in large samples from a single population is needed to provide adequate power to assess relative strengths of natural selection across different functional components of the genome. In this study, we analyzed next-generation sequencing data from 962 European Americans, and found that as expected approximately 60% of the top 1% of positive selection signals lie in intergenic regions, 33% in intronic regions, and slightly over 1% in coding regions. Several detailed functional annotation categories in intergenic regions showed statistically significant enrichment in positively selected loci when compared to the null distribution of the genomic span of ENCODE categories. There was a significant enrichment of purifying selection signals detected in enhancers, transcription factor binding sites, microRNAs and target sites, but not on lincRNA or piRNAs, suggesting different evolutionary constraints for these domains. Loci in “repressed or low activity regions” and loci near or overlapping the transcription start site were the most significantly over-represented annotations among the top 1% of signals for positive selection.
Generation of long (>5 Kb) DNA sequencing reads provides an approach for interrogation of complex regions in the human genome. Currently, large-insert whole genome sequencing (WGS) technologies from Pacific Biosciences (PacBio) enable analysis of chromosomal structural variations (SVs), but the cost to achieve the required sequence coverage across the entire human genome is high.
We developed a method (termed PacBio-LITS) that combines oligonucleotide-based DNA target-capture enrichment technologies with PacBio large-insert library preparation to facilitate SV studies at specific chromosomal regions. PacBio-LITS provides deep sequence coverage at the specified sites at substantially reduced cost compared with PacBio WGS. The efficacy of PacBio-LITS is illustrated by delineating the breakpoint junctions of low copy repeat (LCR)-associated complex structural rearrangements on chr17p11.2 in patients diagnosed with Potocki–Lupski syndrome (PTLS; MIM#610883). We successfully identified previously determined breakpoint junctions in three PTLS cases, and also were able to discover novel junctions in repetitive sequences, including LCR-mediated breakpoints. The new information has enabled us to propose mechanisms for formation of these structural variants.
The new method leverages the cost efficiency of targeted capture-sequencing as well as the mappability and scaffolding capabilities of long sequencing reads generated by the PacBio platform. It is therefore suitable for studying complex SVs, especially those involving LCRs, inversions, and the generation of chimeric Alu elements at the breakpoints. Other genomic research applications, such as haplotype phasing and small insertion and deletion validation could also benefit from this technology.
Electronic supplementary material
The online version of this article (doi:10.1186/s12864-015-1370-2) contains supplementary material, which is available to authorized users.
Targeted sequencing; Single molecule sequencing; Complex genomic rearrangement
Identifying insertion/deletion polymorphisms (INDELs) with high confidence has been intrinsically challenging in short-read sequencing data. Here we report our approach for improving INDEL calling accuracy by using a machine learning algorithm to combine call sets generated with three independent methods, and by leveraging the strengths of each individual pipeline. Utilizing this approach, we generated a consensus exome INDEL call set from a large dataset generated by the 1000 Genomes Project (1000G), maximizing both the sensitivity and the specificity of the calls.
This consensus exome INDEL call set features 7,210 INDELs, from 1,128 individuals across 13 populations included in the 1000 Genomes Phase 1 dataset, with a false discovery rate (FDR) of about 7.0%.
In our study we further characterize the patterns and distributions of these exonic INDELs with respect to density, allele length, and site frequency spectrum, as well as the potential mutagenic mechanisms of coding INDELs in humans.
Electronic supplementary material
The online version of this article (doi:10.1186/s12864-015-1333-7) contains supplementary material, which is available to authorized users.
INDEL; 1000 Genomes Project; Distribution; Mutagenesis
We describe initial steps for interrogating whole genome sequence (WGS) data to characterize the genetic architecture of a complex trait, such as high density lipoprotein cholesterol (HDL-C). We estimate that common variation contributes more to HDL-C heritability than rare variation, and screening for Mendelian dyslipidemia variants identified individuals with extreme HDL-C. WGS analyses highlight the value of regulatory and non-protein coding regions of the genome in addition to protein coding regions.
Interpreting variants, especially noncoding ones, in the increasing
number of personal genomes is challenging. We used patterns of polymorphisms in
functionally annotated regions in 1092 humans to identify deleterious variants;
then we experimentally validated candidates. We analyzed both coding and
noncoding regions, with the former corroborating the latter. We found regions
particularly sensitive to mutations (“ultrasensitive”) and
variants that are disruptive because of mechanistic effects on
transcription-factor binding (that is, “motif-breakers”). We also
found variants in regions with higher network centrality tend to be deleterious.
Insertions and deletions followed a similar pattern to single-nucleotide
variants, with some notable exceptions (e.g., certain deletions and enhancers).
On the basis of these patterns, we developed a computational tool (FunSeq),
whose application to ~90 cancer genomes reveals nearly a hundred
candidate noncoding drivers.
Massively parallel DNA sequencing generates staggering amounts of data. Decreasing cost, increasing throughput, and improved annotation have expanded the diversity of genomics applications in research and clinical practice. This expanding scale creates analytical challenges: accommodating peak compute demand, coordinating secure access for multiple analysts, and sharing validated tools and results.
To address these challenges, we have developed the Mercury analysis pipeline and deployed it in local hardware and the Amazon Web Services cloud via the DNAnexus platform. Mercury is an automated, flexible, and extensible analysis workflow that provides accurate and reproducible genomic results at scales ranging from individuals to large cohorts.
By taking advantage of cloud computing and with Mercury implemented on the DNAnexus platform, we have demonstrated a powerful combination of a robust and fully validated software pipeline and a scalable computational resource that, to date, we have applied to more than 10,000 whole genome and whole exome samples.
NGS data; Variant calling; Annotation; Clinical sequencing; Cloud computing
The synthesis, secretion and clearance of von Willebrand factor (VWF) are regulated by genetic variations in coding and promoter regions of the VWF gene. We have previously identified 19 single nucleotide polymorphisms (SNPs), primarily in introns that are associated with VWF antigen levels in subjects of European descent. In this study, we conducted race by gender analyses to compare the association of VWF SNPs with VWF antigen among 10,434 healthy Americans of European (EA) or African (AA) descent from the Atherosclerosis Risk in Communities (ARIC) study. Among 75 SNPs analyzed, 13 and 10 SNPs were associated with VWF antigen levels in EA male and EA female subjects, respectively. However, only one SNP (RS1063857) was significantly associated with VWF antigen in AA females and none was in AA males. Haplotype analysis of the ARIC samples and studying racial diversities in the VWF gene from the 1000 genomes database suggest a greater degree of variations in the VWF gene in AA subjects as compared to EA subjects. Together, these data suggest potential race and gender divergence in regulating VWF expression by genetic variations.
In the present work, we investigate the pre-existing template-assisted glancing angle deposition of Al columnar structures on Cu substrate by means of molecular dynamics simulations, with a focus on examining the effect of deposition-induced template deformation on the morphologies of the fabricated structures. Our simulations demonstrate that the pre-existing templates significantly intensify the shadowing effect, which thus facilitates the formation of columnar structures under small deposition flux. The underlying deformation modes of the templates under different deposition configurations are analyzed and are correlated to the geometrical characteristics of the columnar structures. It is found that the template height-dependent deformation behavior of the templates strongly influences the morphologies of the fabricated columnar structures. Our findings provide design and fabrication guidelines for the fabrication of one-dimensional nanostructures by the template-assisted deposition technique.
Glancing angle deposition; Template; Deformation mechanism; Molecular dynamics
Gene expression is one of the most relevant biological processes of living cells. Due to the relative small population sizes, it is predicted that human gene sequences are not strongly influenced by selection towards expression efficiency. One of the major problems in estimating to what extent gene characteristics can be selected to maximize expression efficiency is the wide variation that exists in RNA and protein levels among physiological states and different tissues. Analyses of datasets of stably expressed genes (i.e. with consistent expression between physiological states and tissues) would provide more accurate and reliable measurements of associations between variations of a specific gene characteristic and expression, and how distinct gene features work to optimize gene expression.
Using a dataset of human genes with consistent expression between physiological states we selected gene sequence signatures related to translation that can predict about 42% of mRNA variation. The prediction can be increased to 51% when selecting genes that are stably expressed in more than 1 tissue. These genes are enriched for translation and ribosome biosynthesis processes and have higher translation efficiency scores, smaller coding sequences and 3′ UTR sizes and lower folding energies when compared to other datasets. Additionally, the amino acid frequencies weighted by expression showed higher correlations with isoacceptor tRNA gene copy number, and smaller absolute correlation values with biosynthetic costs.
Our results indicate that human gene sequence characteristics related to transcription and translation processes can co-evolve in an integrated manner in order to optimize gene expression.
Until recently, sequencing has primarily been carried out in large genome centers which have invested heavily in developing the computational infrastructure that enables genomic sequence analysis. The recent advancements in next generation sequencing (NGS) have led to a wide dissemination of sequencing technologies and data, to highly diverse research groups. It is expected that clinical sequencing will become part of diagnostic routines shortly. However, limited accessibility to computational infrastructure and high quality bioinformatic tools, and the demand for personnel skilled in data analysis and interpretation remains a serious bottleneck. To this end, the cloud computing and Software-as-a-Service (SaaS) technologies can help address these issues.
We successfully enabled the Atlas2 Cloud pipeline for personal genome analysis on two different cloud service platforms: a community cloud via the Genboree Workbench, and a commercial cloud via the Amazon Web Services using Software-as-a-Service model. We report a case study of personal genome analysis using our Atlas2 Genboree pipeline. We also outline a detailed cost structure for running Atlas2 Amazon on whole exome capture data, providing cost projections in terms of storage, compute and I/O when running Atlas2 Amazon on a large data set.
We find that providing a web interface and an optimized pipeline clearly facilitates usage of cloud computing for personal genome analysis, but for it to be routinely used for large scale projects there needs to be a paradigm shift in the way we develop tools, in standard operating procedures, and in funding mechanisms.
Indels are an important cause of human variation and central to the study of human disease. The 1000 Genomes Project Low-Coverage Pilot identified over 1.3 million indels shorter than 50 bp, of which over 890 were identified as potentially disruptive variants. Yet, despite their ubiquity, the local genomic characteristics of indels remain unexplored.
Herein we describe population- and minor allele frequency-based differences in linkage disequilibrium and imputation characteristics for indels included in the 1000 Genomes Project Low-Coverage Pilot for the CEU, YRI and CHB+JPT populations. Common indels were well tagged by nearby SNPs in all studied populations, and were also tagged at a similar rate to common SNPs. Both neutral and functionally deleterious common indels were imputed with greater than 95% concordance from HapMap Phase 3 and OMNI SNP sites. Further, 38 to 56% of low frequency indels were tagged by low frequency SNPs. We were able to impute heterozygous low frequency indels with over 50% concordance. Lastly, our analysis also revealed evidence of ascertainment bias. This bias prevents us from extending the applicability of our results to highly polymorphic indels that could not be identified in the Low-Coverage Pilot.
Although further scope exists to improve the imputation of low frequency indels, our study demonstrates that there are already ample opportunities to retrospectively impute indels for prior genome-wide association studies and to incorporate indel imputation into future case/control studies.
Next-generation DNA sequencing is opening new avenues for genetic association studies in common diseases that, like deep vein thrombosis (DVT), have a strong genetic predisposition still largely unexplained by currently identified risk variants. In order to develop sequencing and analytical pipelines for the application of next-generation sequencing to complex diseases, we conducted a pilot study sequencing the coding area of 186 hemostatic/proinflammatory genes in 10 Italian cases of idiopathic DVT and 12 healthy controls.
A molecular-barcoding strategy was used to multiplex DNA target capture and sequencing, while retaining individual sequence information. Genomic libraries with barcode sequence-tags were pooled (in pools of 8 or 16 samples) and enriched for target DNA sequences. Sequencing was performed on ABI SOLiD-4 platforms. We produced > 12 gigabases of raw sequence data to sequence at high coverage (average: 42X) the 700-kilobase target area in 22 individuals. A total of 1876 high-quality genetic variants were identified (1778 single nucleotide substitutions and 98 insertions/deletions). Annotation on databases of genetic variation and human disease mutations revealed several novel, potentially deleterious mutations. We tested 576 common variants in a case-control association analysis, carrying the top-5 associations over to replication in up to 719 DVT cases and 719 controls. We also conducted an analysis of the burden of nonsynonymous variants in coagulation factor and anticoagulant genes. We found an excess of rare missense mutations in anticoagulant genes in DVT cases compared to controls and an association for a missense polymorphism of FGA (rs6050; p = 1.9 × 10-5, OR 1.45; 95% CI, 1.22-1.72; after replication in > 1400 individuals).
We implemented a barcode-based strategy to efficiently multiplex sequencing of hundreds of candidate genes in several individuals. In the relatively small dataset of our pilot study we were able to identify bona fide associations with DVT. Our study illustrates the potential of next-generation sequencing for the discovery of genetic variation predisposing to complex diseases.
Deep vein thrombosis; venous thromboembolism; next-generation sequencing; target capture; multiplexing; FGA; rs6025; heamostateome; DVT; VTE
Whole exome capture sequencing allows researchers to cost-effectively sequence the coding regions of the genome. Although the exome capture sequencing methods have become routine and well established, there is currently a lack of tools specialized for variant calling in this type of data.
Using statistical models trained on validated whole-exome capture sequencing data, the Atlas2 Suite is an integrative variant analysis pipeline optimized for variant discovery on all three of the widely used next generation sequencing platforms (SOLiD, Illumina, and Roche 454). The suite employs logistic regression models in conjunction with user-adjustable cutoffs to accurately separate true SNPs and INDELs from sequencing and mapping errors with high sensitivity (96.7%).
We have implemented the Atlas2 Suite and applied it to 92 whole exome samples from the 1000 Genomes Project. The Atlas2 Suite is available for download at http://sourceforge.net/projects/atlas2/. In addition to a command line version, the suite has been integrated into the Genboree Workbench, allowing biomedical scientists with minimal informatics expertise to remotely call, view, and further analyze variants through a simple web interface. The existing genomic databases displayed via the Genboree browser also streamline the process from variant discovery to functional genomics analysis, resulting in an off-the-shelf toolkit for the broader community.
Rare coding variants constitute an important class of human genetic variation, but are underrepresented in current databases that are based on small population samples. Recent studies show that variants altering amino acid sequence and protein function are enriched at low variant allele frequency, 2 to 5%, but because of insufficient sample size it is not clear if the same trend holds for rare variants below 1% allele frequency.
The 1000 Genomes Exon Pilot Project has collected deep-coverage exon-capture data in roughly 1,000 human genes, for nearly 700 samples. Although medical whole-exome projects are currently afoot, this is still the deepest reported sampling of a large number of human genes with next-generation technologies. According to the goals of the 1000 Genomes Project, we created effective informatics pipelines to process and analyze the data, and discovered 12,758 exonic SNPs, 70% of them novel, and 74% below 1% allele frequency in the seven population samples we examined. Our analysis confirms that coding variants below 1% allele frequency show increased population-specificity and are enriched for functional variants.
This study represents a large step toward detecting and interpreting low frequency coding variation, clearly lays out technical steps for effective analysis of DNA capture data, and articulates functional and population properties of this important class of genetic variation.
Rhesus macaques are the most widely utilized nonhuman primate model in biomedical research. Previous efforts have validated fewer than 900 single nucleotide polymorphisms (SNPs) in this species, which limits opportunities for genetic studies related to health and disease. Extensive information about SNPs and other genetic variation in rhesus macaques would facilitate valuable genetic analyses, as well as provide markers for genome-wide linkage analysis and the genetic management of captive breeding colonies.
We used the available rhesus macaque draft genome sequence, new sequence data from unrelated individuals and existing published sequence data to create a genome-wide SNP resource for Indian-origin rhesus monkeys. The original reference animal and two additional Indian-origin individuals were resequenced to low coverage using SOLiD™ sequencing. We then used three strategies to validate SNPs: comparison of potential SNPs found in the same individual using two different sequencing chemistries, and comparison of potential SNPs in different individuals identified with either the same or different sequencing chemistries. Our approach validated approximately 3 million SNPs distributed across the genome. Preliminary analysis of SNP annotations suggests that a substantial number of these macaque SNPs may have functional effects. More than 700 non-synonymous SNPs were scored by Polyphen-2 as either possibly or probably damaging to protein function and these variants now constitute potential models for studying functional genetic variation relevant to human physiology and disease.
Resequencing of a small number of animals identified greater than 3 million SNPs. This provides a significant new information resource for rhesus macaques, an important research animal. The data also suggests that overall genetic variation is high in this species. We identified many potentially damaging non-synonymous coding SNPs, providing new opportunities to identify rhesus models for human disease.
single nucleotide polymorphism; common variants; SOLiD™; genetic variation; rhesus macaque
Genetic studies of populations from the Indian subcontinent are of great interest because of India's large population size, complex demographic history, and unique social structure. Despite recent large-scale efforts in discovering human genetic variation, India's vast reservoir of genetic diversity remains largely unexplored.
To analyze an unbiased sample of genetic diversity in India and to investigate human migration history in Eurasia, we resequenced one 100-kb ENCODE region in 92 samples collected from three castes and one tribal group from the state of Andhra Pradesh in south India. Analyses of the four Indian populations, along with eight HapMap populations (692 samples), showed that 30% of all SNPs in the south Indian populations are not seen in HapMap populations. Several Indian populations, such as the Yadava, Mala/Madiga, and Irula, have nucleotide diversity levels as high as those of HapMap African populations. Using unbiased allele-frequency spectra, we investigated the expansion of human populations into Eurasia. The divergence time estimates among the major population groups suggest that Eurasian populations in this study diverged from Africans during the same time frame (approximately 90 to 110 thousand years ago). The divergence among different Eurasian populations occurred more than 40,000 years after their divergence with Africans.
Our results show that Indian populations harbor large amounts of genetic variation that have not been surveyed adequately by public SNP discovery efforts. Our data also support a delayed expansion hypothesis in which an ancestral Eurasian founding population remained isolated long after the out-of-Africa diaspora, before expanding throughout Eurasia.
Massively parallel sequencing readouts of epigenomic assays are enabling integrative genome-wide analyses of genomic and epigenomic variation. Pash 3.0 performs sequence comparison and read mapping and can be employed as a module within diverse configurable analysis pipelines, including ChIP-Seq and methylome mapping by whole-genome bisulfite sequencing.
Pash 3.0 generally matches the accuracy and speed of niche programs for fast mapping of short reads, and exceeds their performance on longer reads generated by a new generation of massively parallel sequencing technologies. By exploiting longer read lengths, Pash 3.0 maps reads onto the large fraction of genomic DNA that contains repetitive elements and polymorphic sites, including indel polymorphisms.
We demonstrate the versatility of Pash 3.0 by analyzing the interaction between CpG methylation, CpG SNPs, and imprinting based on publicly available whole-genome shotgun bisulfite sequencing data. Pash 3.0 makes use of gapped k-mer alignment, a non-seed based comparison method, which is implemented using multi-positional hash tables. This allows Pash 3.0 to run on diverse hardware platforms, including individual computers with standard RAM capacity, multi-core hardware architectures and large clusters.
Historical episodes of natural selection can skew the frequencies of genetic variants, leaving a signature that can persist for many tens or even hundreds of thousands of years. However, formal tests for selection based on allele frequency skew require strong assumptions about demographic history and mutation, which are rarely well understood. Here, we develop an empirical approach to test for signals of selection that compares patterns of genetic variation at a candidate locus with matched random regions of the genome collected in the same way. We apply this approach to four genes that have been implicated in syndromes of impaired neurological development, comparing the pattern of variation in our re-sequencing data with a large-scale, genomic data set that provides an empirical null distribution. We confirm a previously reported signal at FOXP2, and find a novel signal of selection centered at AHI1, a gene that is involved in motor and behavior abnormalities. The locus is marked by many high frequency derived alleles in non-Africans that are of low frequency in Africans, suggesting that selection at this or a closely neighboring gene occurred in the ancestral population of non-Africans. Our study also provides a prototype for how empirical scans for ancient selection can be carried out once many genomes are sequenced.
A region of approximately one megabase of human Chromosome 12 shows extensive linkage disequilibrium in Utah residents with ancestry from northern and western Europe. This strikingly large linkage disequilibrium block was analyzed with statistical and experimental methods to determine whether natural selection could be implicated in shaping the current genome structure. Extended Haplotype Homozygosity and Relative Extended Haplotype Homozygosity analyses on this region mapped a core region of the strongest conserved haplotype to the exon 1 of the Spinocerebellar ataxia type 2 gene (SCA2). Direct DNA sequencing of this region of the SCA2 gene revealed a significant association between a pre-expanded allele [(CAG)8CAA(CAG)4CAA(CAG)8] of CAG repeats within exon 1 and the selected haplotype of the SCA2 gene. A significantly negative Tajima's D value (−2.20, p < 0.01) on this site consistently suggested selection on the CAG repeat. This region was also investigated in the three other populations, none of which showed signs of selection. These results suggest that a recent positive selection of the pre-expansion SCA2 CAG repeat has occurred in Utah residents with European ancestry.
Natural selection ultimately acts on the genetic variants existing among human populations. Therefore, there are “footprints” that the selective force has left behind in the human genome. In this study, Yu et al. identified an extremely large region on Chromosome 12 that is under positive selection in Utah residents with European ancestry by characterizing the correlation patterns of genomic variants. Further analyses on this interval suggested that selection centered on one of the many forms of Spinocerebellar ataxia type-2 (SCA2) gene. The selected form was next demonstrated to associate with one short version of the disease-causing CAG repeat in the SCA2 gene. These results suggest that the CAG repeat was positively selected. An abnormally long version of CAGs can cause SCA2, a neurodegenerative disease that severely impairs the abilities of body movement. The authors showed how they unraveled natural selection acting on the SCA2 gene. Their findings might lead to the discovery of the biological functions of this gene and its CAG repeat. This kind of study holds potential to facilitate the finding of common disease genes.