Exome sequencing is emerging as a popular approach to study the effect of rare coding variants on complex phenotypes. The promise of exome sequencing is grounded in theoretical population genetics and in empirical successes of candidate gene sequencing studies. Many projects aimed at common diseases are underway, and their results are eagerly anticipated. In this Perspective, using exome sequencing data from 438 individuals, we discuss several aspects of exome sequencing studies that we view as particularly important. We review processing and quality control of raw sequence data, evaluate the statistical properties of exome sequencing studies, discuss rare variant burden tests to detect association to phenotypes, and demonstrate the importance of accounting for population stratification in the analysis of rare variants. We conclude that enthusiasm for exome sequencing studies of complex traits should be combined with the caution that thousands of samples may be required to reach sufficient statistical power.
Summary: Genetic association studies making use of high-throughput genotyping arrays need to process large amounts of data in the order of millions of markers per experiment. The first step of any analysis with genotyping arrays is typically the conduct of a thorough data clean up and quality control to remove poor quality genotypes and generate metrics to inform and select individuals for downstream statistical analysis. We have developed pyGenClean, a bioinformatics tool to facilitate and standardize the genetic data clean up pipeline with genotyping array data. In conjunction with a source batch-queuing system, the tool minimizes data manipulation errors, accelerates the completion of the data clean up process and provides informative plots and metrics to guide decision making for statistical analysis.
Availability and implementation:
pyGenClean is an open source Python 2.7 software and is freely available, along with documentation and examples, from http://www.statgen.org.
firstname.lastname@example.org or email@example.com
Motivation: Despite the prevalence of copy number variation (CNV) in the human genome, only a handful of confirmed associations have been reported between common CNVs and complex disease. This may be partially attributed to the difficulty in accurately genotyping CNVs in large cohorts using array-based technologies. Exome sequencing is now widely being applied to case–control cohorts and presents an exciting opportunity to look for common CNVs associated with disease.
Results: We developed ExoCNVTest: an exome sequencing analysis pipeline to identify disease-associated CNVs and to generate absolute copy number genotypes at putatively associated loci. Our method re-discovered the LCE3B_LCE3C CNV association with psoriasis (P-value = 5 × 10e−6) while controlling inflation of test statistics (λ < 1). ExoCNVTest-derived absolute CNV genotypes were 97.4% concordant with PCR-derived genotypes at this locus.
Availability and implementation: ExoCNVTest has been implemented in Java and R and is freely available from www1.imperial.ac.uk/medicine/people/l.coin/.
firstname.lastname@example.org or Lachlan.J.M.Coin@genomics.org.cn
Many exome sequencing studies of Mendelian disorders fail to optimally exploit family information. Classical genetic linkage analysis is an effective method for eliminating a large fraction of the candidate causal variants discovered, even in small families that lack a unique linkage peak. We demonstrate that accurate genetic linkage mapping can be performed using SNP genotypes extracted from exome data, removing the need for separate array-based genotyping. We provide software to facilitate such analyses.
Genotyping arrays are a cost effective approach when typing previously-identified genetic polymorphisms in large numbers of samples. One limitation of genotyping arrays with rare variants (e.g., minor allele frequency [MAF] <0.01) is the difficulty that automated clustering algorithms have to accurately detect and assign genotype calls. Combining intensity data from large numbers of samples may increase the ability to accurately call the genotypes of rare variants. Approximately 62,000 ethnically diverse samples from eleven Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium cohorts were genotyped with the Illumina HumanExome BeadChip across seven genotyping centers. The raw data files for the samples were assembled into a single project for joint calling. To assess the quality of the joint calling, concordance of genotypes in a subset of individuals having both exome chip and exome sequence data was analyzed. After exclusion of low performing SNPs on the exome chip and non-overlap of SNPs derived from sequence data, genotypes of 185,119 variants (11,356 were monomorphic) were compared in 530 individuals that had whole exome sequence data. A total of 98,113,070 pairs of genotypes were tested and 99.77% were concordant, 0.14% had missing data, and 0.09% were discordant. We report that joint calling allows the ability to accurately genotype rare variation using array technology when large sample sizes are available and best practices are followed. The cluster file from this experiment is available at www.chargeconsortium.com/main/exomechip.
Motivation: Exome sequencing has proven to be an effective tool to discover the genetic basis of Mendelian disorders. It is well established that copy number variants (CNVs) contribute to the etiology of these disorders. However, calling CNVs from exome sequence data is challenging. A typical read depth strategy consists of using another sample (or a combination of samples) as a reference to control for the variability at the capture and sequencing steps. However, technical variability between samples complicates the analysis and can create spurious CNV calls.
Results: Here, we introduce ExomeDepth, a new CNV calling algorithm designed to control for this technical variability. ExomeDepth uses a robust model for the read count data and uses this model to build an optimized reference set in order to maximize the power to detect CNVs. As a result, ExomeDepth is effective across a wider range of exome datasets than the previously existing tools, even for small (e.g. one to two exons) and heterozygous deletions. We used this new approach to analyse exome data from 24 patients with primary immunodeficiencies. Depending on data quality and the exact target region, we find between 170 and 250 exonic CNV calls per sample. Our analysis identified two novel causative deletions in the genes GATA2 and DOCK8.
Availability: The code used in this analysis has been implemented into an R package called ExomeDepth and is available at the Comprehensive R Archive Network (CRAN).
Supplementary data are available at Bioinformatics online.
Recent advances in next-generation sequencing technologies have transformed the genetics study of human diseases; this is an era of unprecedented productivity. Exome sequencing, the targeted sequencing of the protein-coding portion of the human genome, has been shown to be a powerful and cost-effective method for detection of disease variants underlying Mendelian disorders. Increasing effort has been made in the interest of the identification of rare variants associated with complex traits in sequencing studies. Here we provided an overview of the application fields for exome sequencing in human diseases. We describe a general framework of computation and bioinformatics for handling sequencing data. We then demonstrate data quality and agreement between exome sequencing and exome microarray (chip) genotypes using data collected on the same set of subjects in a genetic study of panic disorder. Our results show that, in sequencing data, the data quality was generally higher for variants within the exonic target regions, compared to that outside the target regions, due to the target enrichment. We also compared genotype concordance for variant calls obtained by exome sequencing vs. exome genotyping microarrays. The overall consistency rate was >99.83% and the heterozygous consistency rate was >97.55%. The two platforms share a large amount of agreement over low frequency variants in the exonic regions, while exome sequencing provides much more information on variants not included on exome genotyping microarrays. The results demonstrate that exome sequencing data are of high quality and can be used to investigate the role of rare coding variants in human diseases.
exome sequencing; exome arrays; Mendelian diseases; complex traits; whole-genome sequencing
Next generation sequencing (NGS) has been leading the genetic study of human disease into an era of unprecedented productivity. Many bioinformatics pipelines have been developed to call variants from NGS data. The performance of these pipelines depends crucially on the variant caller used and on the calling strategies implemented. We studied the performance of four prevailing callers, SAMtools, GATK, glftools and Atlas2, using single-sample and multiple-sample variant-calling strategies. Using the same aligner, BWA, we built four single-sample and three multiple-sample calling pipelines and applied the pipelines to whole exome sequencing data taken from 20 individuals. We obtained genotypes generated by Illumina Infinium HumanExome v1.1 Beadchip for validation analysis and then used Sanger sequencing as a “gold-standard” method to resolve discrepancies for selected regions of high discordance. Finally, we compared the sensitivity of three of the single-sample calling pipelines using known simulated whole genome sequence data as a gold standard. Overall, for single-sample calling, the called variants were highly consistent across callers and the pairwise overlapping rate was about 0.9. Compared with other callers, GATK had the highest rediscovery rate (0.9969) and specificity (0.99996), and the Ti/Tv ratio out of GATK was closest to the expected value of 3.02. Multiple-sample calling increased the sensitivity. Results from the simulated data suggested that GATK outperformed SAMtools and glfSingle in sensitivity, especially for low coverage data. Further, for the selected discrepant regions evaluated by Sanger sequencing, variant genotypes called by exome sequencing versus the exome array were more accurate, although the average variant sensitivity and overall genotype consistency rate were as high as 95.87% and 99.82%, respectively. In conclusion, GATK showed several advantages over other variant callers for general purpose NGS analyses. The GATK pipelines we developed perform very well.
Filter metrics are used as a quick assessment of sequence trace files in order to sort data into different categories, i.e. High Quality, Review, and Low Quality, without human intervention. The filter metrics consist of two numerical parameters for sequence quality assessment: trace score (TS) and contiguous read length (CRL). Primer specific settings for the TS and CRL were established using a calibration dataset of 2817 traces and validated using a concordance dataset of 5617 traces. Prior to optimization, 57% of the traces required manual review before import into a sequence analysis program, whereas after optimization only 28% of the traces required manual review. After optimization of primer specific filter metrics for mitochondrial DNA sequence data, an overall reduction of review of trace files translates into increased throughput of data analysis and decreased time required for manual review.
Filter metrics; expert systems; trace score; contiguous read length; quality assessment
The detection of genetic segments of Identical by Descent (IBD) in Genome-Wide Association Studies has proven successful in pinpointing genetic relatedness between reportedly unrelated individuals and leveraging such regions to shortlist candidate genes. These techniques depend on high-density genotyping arrays and their effectiveness in diverse sequence data is largely unknown. Due to decreasing costs and increasing effectiveness of high throughput techniques for whole-exome sequencing, an influx of exome sequencing data has become available. Studies using exomes and IBD-detection methods within known pedigrees have shown that IBD can be useful in finding hidden genetic candidates where known relatives are available. We set out to examine the viability of using IBD-detection in whole exome sequencing data in population-wide studies. In doing so, we extend GERMLINE, a method to detect IBD from exome sequencing data by finding small slices of matching alleles between pairs of individuals and extending them into full IBD segments. This algorithm allows for efficient population-wide detection in dense data. We apply this algorithm to a cohort of Crohn's Disease cases where whole-exome and GWAS array data is available. We confirm that GWAS-based detected segments are highly accurate and predictive of underlying shared variation. Where segments inferred from GWAS are expected to be of high accuracy, we compare exome-based detection accuracy of multiple detection strategies. We find detection accuracy to be prohibitively low in all assessments, both in terms of segment sensitivity and specificity. Even after isolating relatively long segments beyond 10cM, exome-based detection continued to offer poor specificity/sensitivity tradeoffs. We hypothesize that the variable coverage and platform biases of exome capture account for this decreased accuracy and look toward whole genome sequencing data as a higher quality source for detecting population-wide IBD.
Isolated hypoparathyroidism (IH) shows heterogeneous phenotypes and can be caused by defects in a variety of genes. The goal of our study was to determine the clinical features and to analyze gene mutations in a large cohort of Korean patients with sporadic or familial IH. We recruited 23 patients. They showed a broad range of onset age and various values of biochemical data. Whole exome sequencing was performed on two affected cases and one unaffected individual in a family. All coding exons and exon-intron borders of GCMB, CASR, and prepro-PTH were sequenced using PCR-amplified DNA. In one family who underwent the whole exome sequencing analysis, approximately 300 single nucleotide changes emerged as candidates for genetic alteration. Among them, we identified a functional mutation in exon 2 of GCMB (C106R) in two affected cases. Besides, heterozygous gain-of-function mutations in the CASR gene were found in other subjects; D410E and P221L. We also found one single nucleotide polymorphism (SNP) in the prepro-PTH gene, five SNPs in the CASR gene, and four SNPs in the GCMB gene. The current study represents a variety of biochemical phenotypes in IH patients with the molecular genetic diagnosis of IH.
CASR; GCMB; Hypocalcemia; Hypoparathyroidism; Prepro-PTH
Multiple genes have been implicated by association studies in altering inflammatory bowel disease (IBD) predisposition. Paediatric patients often manifest more extensive disease and a particularly severe disease course. It is likely that genetic predisposition plays a more substantial role in this group.
To identify the spectrum of rare and novel variation in known IBD susceptibility genes using exome sequencing analysis in eight individual cases of childhood onset severe disease.
DNA samples from the eight patients underwent targeted exome capture and sequencing. Data were processed through an analytical pipeline to align sequence reads, conduct quality checks, and identify and annotate variants where patient sequence differed from the reference sequence. For each patient, the entire complement of rare variation within strongly associated candidate genes was catalogued.
Across the panel of 169 known IBD susceptibility genes, approximately 300 variants in 104 genes were found. Excluding splicing and HLA-class variants, 58 variants across 39 of these genes were classified as rare, with an alternative allele frequency of <5%, of which 17 were novel. Only two patients with early onset Crohn's disease exhibited rare deleterious variations within NOD2: the previously described R702W variant was the sole NOD2 variant in one patient, while the second patient also carried the L1007 frameshift insertion. Both patients harboured other potentially damaging mutations in the GSDMB, ERAP2 and SEC16A genes. The two patients severely affected with ulcerative colitis exhibited a distinct profile: both carried potentially detrimental variation in the BACH2 and IL10 genes not seen in other patients.
For each of the eight individuals studied, all non-synonymous, truncating and frameshift mutations across all known IBD genes were identified. A unique profile of rare and potentially damaging variants was evident for each patient with this complex disease.
IBD-genetics; inflammatory bowel disease; crohn's disease; paediatric gastroenterology; ulcerative colitis; zollinger ellison syndrome,
Whole exome capture sequencing allows researchers to cost-effectively sequence the coding regions of the genome. Although the exome capture sequencing methods have become routine and well established, there is currently a lack of tools specialized for variant calling in this type of data.
Using statistical models trained on validated whole-exome capture sequencing data, the Atlas2 Suite is an integrative variant analysis pipeline optimized for variant discovery on all three of the widely used next generation sequencing platforms (SOLiD, Illumina, and Roche 454). The suite employs logistic regression models in conjunction with user-adjustable cutoffs to accurately separate true SNPs and INDELs from sequencing and mapping errors with high sensitivity (96.7%).
We have implemented the Atlas2 Suite and applied it to 92 whole exome samples from the 1000 Genomes Project. The Atlas2 Suite is available for download at http://sourceforge.net/projects/atlas2/. In addition to a command line version, the suite has been integrated into the Genboree Workbench, allowing biomedical scientists with minimal informatics expertise to remotely call, view, and further analyze variants through a simple web interface. The existing genomic databases displayed via the Genboree browser also streamline the process from variant discovery to functional genomics analysis, resulting in an off-the-shelf toolkit for the broader community.
We introduce a simple and yet scientifically objective criterion for identifying SNPs with genotyping errors due to poor clustering. This yields a metric for assessing the stability of the assigned genotypes by evaluating the extent of discordance between the calls made with the unperturbed and perturbed intensities. The efficacy of the metric is evaluated by: (1) estimating the extent of over-dispersion of the Hardy-Weinberg equilibrium chi-square test statistics; (2) an interim case-control study, where we investigated the efficacy of the introduced metric and standard quality control filters in reducing the number of SNPs with evidence of phenotypic association which are attributed to genotyping errors; (3) investigating the call and concordance rates of SNPs identified by perturbation analysis which have been genotyped on both Affymetrix and Illumina platforms. Removing SNPs identified by the extent of discordance can reduce the degree of over-dispersion of the HWE test statistic. Sensible use of perturbation analysis in an association study can correctly identify SNPs with problematic genotyping, reducing the number required for visual inspection. SNPs identified by perturbation analysis had lower call and concordance rates, and removal of these SNPs significantly improved the performance for the remaining SNPs.
genotyping errors; genome-wide association studies; genotype calling
To perform initial DNA analysis of four selected early mediaeval individuals from the Zvonimirovo burial site in Northern Croatia.
Investigation of genetic matching of individuals from a “double burial” and of individuals with shared cranial non-metric/metric traits from 2 single inhumations, located in another block of the cemetery complex, was carried out. DNA from four teeth samples was extracted, quantified, and amplified by polymerase chain reaction (PCR) for short tandem repeat loci, using AmpFlSTR Profiler™ PCR Amplification Kit.
Autosomal short tandem repeat (STR) genotyping generated high parentage probability (PP) as to the matching of the 2 individuals from the “double burial” (PP 98.63%), and of 2 women with shared cranial non-metric/metric traits from neighboring single burials (PP 90.07%). Parentage probability calculations of a possible genetic matching of the subadult from a “double burial” with the adults from single burials 4 and 3 were significantly lower (PP 60.45% and 38.52%). DNA typing for amelogenin confirmed the sex of the 3 female individuals, estimated previously by morphology. The unknown sex of the subadult was also determined as female.
Increased parentage probability for autosomal STR loci matches and the presence of a rare allele shared among matched individuals support their possible kinship relationship, in accordance with bioarchaeological data. We assume an intentional double burial based on a close familial relationship, ie 2 single neighboring inhumations based on consanguinity, rather than a strong social relationship. The kinship lineages remain unknown at this point.
Myopia is the most common ocular disorder worldwide, and high myopia in particular is one of the leading causes of blindness. Genetic factors play a critical role in the development of myopia, especially high myopia. Recently, the exome sequencing approach has been successfully used for the disease gene identification of Mendelian disorders. Here we show a successful application of exome sequencing to identify a gene for an autosomal dominant disorder, and we have identified a gene potentially responsible for high myopia in a monogenic form. We captured exomes of two affected individuals from a Han Chinese family with high myopia and performed sequencing analysis by a second-generation sequencer with a mean coverage of 30× and sufficient depth to call variants at ∼97% of each targeted exome. The shared genetic variants of these two affected individuals in the family being studied were filtered against the 1000 Genomes Project and the dbSNP131 database. A mutation A672G in zinc finger protein 644 isoform 1 (ZNF644) was identified as being related to the phenotype of this family. After we performed sequencing analysis of the exons in the ZNF644 gene in 300 sporadic cases of high myopia, we identified an additional five mutations (I587V, R680G, C699Y, 3′UTR+12 C>G, and 3′UTR+592 G>A) in 11 different patients. All these mutations were absent in 600 normal controls. The ZNF644 gene was expressed in human retinal and retinal pigment epithelium (RPE). Given that ZNF644 is predicted to be a transcription factor that may regulate genes involved in eye development, mutation may cause the axial elongation of eyeball found in high myopia patients. Our results suggest that ZNF644 might be a causal gene for high myopia in a monogenic form.
People with myopia see near objects more clearly than objects far away. Myopia is the most common ocular disorder worldwide, with a high prevalence in Asian (40%–70%) and Caucasian (20%–30%) populations. Although the etiologies of myopia have not yet been established, previous studies have indicated the involvement of genetic and environmental factors (such as close working habits, higher education levels, and higher socioeconomic class). Genetic factors play a critical role in the development of myopia, especially high myopia. In this study, we use exome sequencing, a powerful tool for a disease gene identification, to identify a gene involved in high myopia in a monogenic form among Han Chinese. Mutations in zinc finger protein 644 isoform 1 (ZNF644) were identified as potentially responsible for the phenotype of high myopia. The main feature of high myopia is axial elongation of the eye globe. Given that ZNF644 is predicted to be a transcription factor that may regulate genes involved in eye development, a mutant ZNF644 protein may impact the normal eye development and therefore may underlie the axial elongation of the eye globe in high myopia patients. Further study of the biological function of ZNF644 will provide insight into the pathogenesis of myopia.
Summary:: The assessment of data quality is a major concern in microarray analysis. arrayQualityMetrics is a Bioconductor package that provides a report with diagnostic plots for one or two colour microarray data. The quality metrics assess reproducibility, identify apparent outlier arrays and compute measures of signal-to-noise ratio. The tool handles most current microarray technologies and is amenable to use in automated analysis pipelines or for automatic report generation, as well as for use by individuals. The diagnosis of quality remains, in principle, a context-dependent judgement, but our tool provides powerful, automated, objective and comprehensive instruments on which to base a decision.
Availability:: arrayQualityMetrics is a free and open source package, under LGPL license, available from the Bioconductor project at www.bioconductor.org. A users guide and examples are provided with the package. Some examples of HTML reports generated by arrayQualityMetrics can be found at http://www.microarray-quality.org
Supplementary information:: Supplementary data are available at Bioinformatics online.
DNA from buccal brush samples is being used for high-throughput analyses in a variety of applications, but the impact of sample type on genotyping success and downstream statistical analysis remains unclear. The objective of the current study was to determine laboratory predictors of genotyping failure among buccal DNA samples, and to evaluate the successfully genotyped results with respect to analytic quality control metrics. Sample and genotyping characteristics were compared between buccal and blood samples collected in the population-based Genetic and Environmental Risk Factors for Hemorrhagic Stroke (GERFHS) study (https://gerfhs.phs.wfubmc.edu/public/index.cfm).
Seven-hundred eight (708) buccal and 142 blood DNA samples were analyzed for laboratory-based and analysis metrics. Overall genotyping failure rates were not statistically different between buccal (11.3%) and blood (7.0%, p = 0.18) samples; however, both the Contrast Quality Control (cQC) rate and the dynamic model (DM) call rates were lower among buccal DNA samples (p < 0.0001). The ratio of double-stranded to total DNA (ds/total ratio) in the buccal samples was the only laboratory characteristic predicting sample success (p < 0.0001). A threshold of at least 34% ds/total DNA provided specificity of 98.7% with a 90.5% negative predictive value for eliminating probable failures. After genotyping, median sample call rates (99.1% vs. 99.4%, p < 0.0001) and heterozygosity rates (25.6% vs. 25.7%, p = 0.006) were lower for buccal versus blood DNA samples, respectively, but absolute differences were small. Minor allele frequency differences from HapMap were smaller for buccal than blood samples, and both sample types demonstrated tight genotyping clusters, even for rare alleles.
We identified a buccal sample characteristic, a ratio of ds/total DNA <34%, which distinguished buccal DNA samples likely to fail high-throughput genotyping. Applying this threshold, the quality of final genotyping resulting from buccal samples is somewhat lower, but compares favorably to blood. Caution is warranted if cases and controls have different sample types, but buccal samples provide comparable results to blood samples in large-scale genotyping analyses.
Buccal; Blood; DNA; Quality; Minor allele frequency (MAF); Genetic
To identify rare variants contributing to multiple sclerosis (MS) susceptibility in a family we have previously reported with up to 15 individuals affected across 4 generations.
We performed exome sequencing in a subset of affected individuals to identify novel variants contributing to MS risk within this unique family. The candidate variant was genotyped in a validation cohort of 2,104 MS trio families.
Four family members with MS were sequenced and 21,583 variants were found to be shared among these individuals. Refining the variants to those with 1) a predicted loss of function and 2) present within regions of modest haplotype sharing identified 1 novel mutation (rs55762744) in the tyrosine kinase 2 (TYK2) gene. A different polymorphism within this gene has been shown to be protective in genome-wide association studies. In contrast, the TYK2 variant identified here is a novel, missense mutation and was found to be present in 10/14 (72%) cases and 28/60 (47%) of the unaffected family members. Genotyping additional 2,104 trio families showed the variant to be transmitted preferentially from heterozygous parents (transmitted 16: not transmitted 5; χ2 = 5.76, p = 0.016).
Rs55762744 is a rare variant of modest effect on MS risk affecting a subset of patients (0.8%). Within this pedigree, rs55762744 is common and appears to be a modifier of modest risk effect. Exome sequencing is a quick and cost-effective method and we show here the utility of sequencing a few cases from a single, unique family to identify a novel variant. The sequencing of additional family members or other families may help identify other variants important in MS.
Motivation: Next-generation sequencing and exome-capture technologies are currently revolutionizing the way geneticists screen for disease-causing mutations in rare Mendelian disorders. However, the identification of causal mutations is challenging due to the sheer number of variants that are identified in individual exomes. Although databases such as dbSNP or HapMap can be used to reduce the plethora of candidate genes by filtering out common variants, the remaining set of genes still remains on the order of dozens.
Results: Our algorithm uses a non-homogeneous hidden Markov model that employs local recombination rates to identify chromosomal regions that are identical by descent (IBD = 2) in children of consanguineous or non-consanguineous parents solely based on genotype data of siblings derived from high-throughput sequencing platforms. Using simulated and real exome sequence data, we show that our algorithm is able to reduce the search space for the causative disease gene to a fifth or a tenth of the entire exome.
Availability: An R script and an accompanying tutorial are available at http://compbio.charite.de/index.php/ibd2.html.
We use least absolute shrinkage and selection operator (LASSO) regression to select genetic markers and phenotypic features that are most informative with respect to a trait of interest. We compare several strategies for applying LASSO methods in risk prediction models, using the Genetic Analysis Workshop 17 exome simulation data consisting of 697 individuals with information on genotypic and phenotypic features (smoking, age, sex) in 5-fold cross-validated fashion. The cross-validated averages of the area under the receiver operating curve range from 0.45 to 0.63 for different strategies using only genotypic markers. The same values are improved to 0.69–0.87 when both genotypic and phenotypic information are used. The ability of the LASSO method to find true causal markers is limited, but the method was able to discover several common variants (e.g., FLT1) under certain conditions.
Next-generation sequencing technologies now make it possible to genotype and measure hundreds of thousands of rare genetic variations in individuals across the genome. Characterization of high-density genetic variation facilitates control of population genetic structure on a finer scale before large-scale genotyping in disease genetics studies. Population structure is a well-known, prevalent, and important factor in common variant genetic studies, but its relevance in rare variants is unclear. We perform an extensive population structure analysis using common and rare functional variants from the Genetic Analysis Workshop 17 mini-exome sequence. The analysis based on common functional variants required 388 principal components to account for 90% of the variation in population structure. However, an analysis based on rare variants required 532 significant principal components to account for similar levels of variation. Using rare variants, we detected fine-scale substructure beyond the population structure identified using common functional variants. Our results show that the level of population structure embedded in rare variant data is different from the level embedded in common variant data and that correcting for population structure is only as good as the level one wishes to correct.
The molecular diagnosis of muscle disorders is challenging: genetic heterogeneity (>100 causal genes for skeletal and cardiac muscle disease) precludes exhaustive clinical testing, prioritizing sequencing of specific genes is difficult due to the similarity of clinical presentation, and the number of variants returned through exome sequencing can make the identification of the disease-causing variant difficult. We have filtered variants found through exome sequencing by prioritizing variants in genes known to be involved in muscle disease while examining the quality and depth of coverage of those genes. We ascertained two families with autosomal dominant limb-girdle muscular dystrophy of unknown etiology. To identify the causal mutations in these families, we performed exome sequencing on five affected individuals using the Agilent SureSelect Human All Exon 50 Mb kit and the Illumina HiSeq 2000 (2×100 bp). We identified causative mutations in desmin (IVS3+3A>G) and filamin C (p.W2710X), and augmented the phenotype data for individuals with muscular dystrophy due to these mutations. We also discuss challenges encountered due to depth of coverage variability at specific sites and the annotation of a functionally proven splice site variant as an intronic variant.
We have developed an integrated strategy for targeted resequencing and analysis of gene subsets from the human exome for variants. Our capture technology is geared towards resequencing gene subsets substantially larger than can be done efficiently with simplex or multiplex PCR but smaller in scale than exome sequencing. We describe all the steps from the initial capture assay to single nucleotide variant (SNV) discovery. The capture methodology uses in-solution 80-mer oligonucleotides. To provide optimal flexibility in choosing human gene targets, we designed an in silico set of oligonucleotides, the Human OligoExome, that covers the gene exons annotated by the Consensus Coding Sequencing Project (CCDS). This resource is openly available as an Internet accessible database where one can download capture oligonucleotides sequences for any CCDS gene and design custom capture assays. Using this resource, we demonstrated the flexibility of this assay by custom designing capture assays ranging from 10 to over 100 gene targets with total capture sizes from over 100 Kilobases to nearly one Megabase. We established a method to reduce capture variability and incorporated indexing schemes to increase sample throughput. Our approach has multiple applications that include but are not limited to population targeted resequencing studies of specific gene subsets, validation of variants discovered in whole genome sequencing surveys and possible diagnostic analysis of disease gene subsets. We also present a cost analysis demonstrating its cost-effectiveness for large population studies.
The advent of massively parallel sequencing technologies (Next Generation Sequencing, NGS) profoundly modified the landscape of human genetics.
In particular, Whole Exome Sequencing (WES) is the NGS branch that focuses on the exonic regions of the eukaryotic genomes; exomes are ideal to help us understanding high-penetrance allelic variation and its relationship to phenotype. A complete WES analysis involves several steps which need to be suitably designed and arranged into an efficient pipeline.
Managing a NGS analysis pipeline and its huge amount of produced data requires non trivial IT skills and computational power.
Our web resource WEP (Whole-Exome sequencing Pipeline web tool) performs a complete WES pipeline and provides easy access through interface to intermediate and final results. The WEP pipeline is composed of several steps:
1) verification of input integrity and quality checks, read trimming and filtering; 2) gapped alignment; 3) BAM conversion, sorting and indexing; 4) duplicates removal; 5) alignment optimization around insertion/deletion (indel) positions; 6) recalibration of quality scores; 7) single nucleotide and deletion/insertion polymorphism (SNP and DIP) variant calling; 8) variant annotation; 9) result storage into custom databases to allow cross-linking and intersections, statistics and much more. In order to overcome the challenge of managing large amount of data and maximize the biological information extracted from them, our tool restricts the number of final results filtering data by customizable thresholds, facilitating the identification of functionally significant variants. Default threshold values are also provided at the analysis computation completion, tuned with the most common literature work published in recent years.
Through our tool a user can perform the whole analysis without knowing the underlying hardware and software architecture, dealing with both paired and single end data. The interface provides an easy and intuitive access for data submission and a user-friendly web interface for annotated variant visualization.
Non-IT mastered users can access through WEP to the most updated and tested WES algorithms, tuned to maximize the quality of called variants while minimizing artifacts and false positives.
The web tool is available at the following web address: http://www.caspur.it/wep