Summary: Frozen robust multiarray analysis (fRMA) is a single-array preprocessing algorithm that retains the advantages of multiarray algorithms and removes certain batch effects by downweighting probes that have high between-batch residual variance. Here, we extend the fRMA algorithm to two new microarray platforms—Affymetrix Human Exon and Gene 1.0 ST—by modifying the fRMA probe-level model and extending the frma package to work with oligo ExonFeatureSet and GeneFeatureSet objects.
Availability and implementation: All packages are implemented in R. Source code and binaries are freely available through the Bioconductor project. Convenient links to all software and data packages can be found at http://mnmccall.com/software
For a better understanding of the biology of an organism, a complete description is needed of all regions of the genome that are actively transcribed. Tiling arrays are used for this purpose. They allow for the discovery of novel transcripts and the assessment of differential expression between two or more experimental conditions such as genotype, treatment, tissue, etc. In tiling array literature, many efforts are devoted to transcript discovery, whereas more recent developments also focus on differential expression. To our knowledge, however, no methods for tiling arrays have been described that can simultaneously assess transcript discovery and identify differentially expressed transcripts. In this paper, we adopt wavelet based functional models to the context of tiling arrays. The high dimensionality of the data triggered us to avoid inference based on Bayesian MCMC methods. Instead, we introduce a fast empirical Bayes method that provides adaptive regularization of the functional effects. A simulation study and a case study illustrate that our approach is well suited for the simultaneous assessment of transcript discovery and differential expression in tiling array studies, and that it outperforms methods that accomplish only one of these tasks.
tiling microarray; wavelets; adaptive regularization; transcript discovery; differential expression; genomics; Arabidopsis thaliana
Human cancers nearly ubiquitously harbor epigenetic alterations. While such alterations in epigenetic marks, including DNA methylation, are potentially heritable, they can also be dynamically altered. Given this potential for plasticity, the degree to which epigenetic changes can be subject to selection and act as drivers of neoplasia has been questioned. Here, we carried out genome-scale analyses of DNA methylation alterations in lethal metastatic prostate cancer and created DNA methylation “cityscape” plots to visualize these complex data. We show that somatic DNA methylation alterations, despite showing marked inter-individual heterogeneity among men with lethal metastatic prostate cancer, were maintained across all metastases within the same individual. The overall extent of maintenance in DNA methylation changes was comparable to that of genetic copy number alterations. Regions that were frequently hypermethylated across individuals were markedly enriched for cancer and development/differentiation related genes. Additionally, regions exhibiting high consistency of hypermethylation across metastases within individuals, even if variably hypermethylated across individuals, showed enrichment of cancer-related genes. Interestingly, whereas some regions showed intra-individual metastatic tumor heterogeneity in promoter methylation, such methylation alterations were generally not correlated with gene expression. This was despite a general tendency for promoter methylation patterns to be strongly correlated with gene expression, particularly at regions that were variably methylated across individuals. These findings suggest that DNA methylation alterations have the potential for producing selectable driver events in carcinogenesis and disease progression and highlight the possibility of targeting such epigenome alterations for development of longitudinal markers and therapeutic strategies.
In honeybee societies, distinct caste phenotypes are created from the same genotype, suggesting a role for epigenetics in deriving these behaviorally different phenotypes. We found no differences in DNA methylation between irreversible worker/queen castes, but substantial differences between nurses and forager subcastes. Reverting foragers back to nurses reestablished methylation levels for a majority of genes and provided the first evidence in any organism of reversible epigenetic changes associated with behavior.
Motivation: Although chromatin immunoprecipitation coupled with
high-throughput sequencing (ChIP-seq) or tiling array hybridization (ChIP-chip) is
increasingly used to map genome-wide–binding sites of transcription factors (TFs),
it still remains difficult to generate a quality ChIPx (i.e. ChIP-seq or ChIP-chip)
dataset because of the tremendous amount of effort required to develop effective
antibodies and efficient protocols. Moreover, most laboratories are unable to easily
obtain ChIPx data for one or more TF(s) in more than a handful of biological contexts.
Thus, standard ChIPx analyses primarily focus on analyzing data from one experiment, and
the discoveries are restricted to a specific biological context.
Results: We propose to enrich this existing data analysis paradigm by
developing a novel approach, ChIP-PED, which superimposes ChIPx data on large amounts of
publicly available human and mouse gene expression data containing a diverse collection of
cell types, tissues and disease conditions to discover new biological contexts with
potential TF regulatory activities. We demonstrate ChIP-PED using a number of examples,
including a novel discovery that MYC, a human TF, plays an important
functional role in pediatric Ewing sarcoma cell lines. These examples show that ChIP-PED
increases the value of ChIPx data by allowing one to expand the scope of possible
discoveries made from a ChIPx experiment.
Supplementary data are available at Bioinformatics
Background Gestational age at birth strongly predicts neonatal, adolescent and adult morbidity and mortality through mostly unknown mechanisms. Identification of specific genes that are undergoing regulatory change prior to birth, such as through changes in DNA methylation, would increase our understanding of developmental changes occurring during the third trimester and consequences of pre-term birth (PTB).
Methods We performed a genome-wide analysis of DNA methylation (using microarrays, specifically CHARM 2.0) in 141 newborns collected in Baltimore, MD, using novel statistical methodology to identify genomic regions associated with gestational age at birth. Bisulphite pyrosequencing was used to validate significant differentially methylated regions (DMRs), and real-time PCR was performed to assess functional significance of differential methylation in a subset of newborns.
Results We identified three DMRs at genome-wide significance levels adjacent to the NFIX, RAPGEF2 and MSRB3 genes. All three regions were validated by pyrosequencing, and RAGPEF2 also showed an inverse correlation between DNA methylation levels and gene expression levels. Although the three DMRs appear very dynamic with gestational age in our newborn sample, adult DNA methylation levels at these regions are stable and of equal or greater magnitude than the oldest neonate, directionally consistent with the gestational age results.
Conclusions We have identified three differentially methylated regions associated with gestational age at birth. All three nearby genes play important roles in the development of several organs, including skeletal muscle, brain and haematopoietic system. Therefore, they may provide initial insight into the basis of PTB's negative health outcomes. The genome-wide custom DNA methylation array technology and novel statistical methods employed in this study could constitute a model for epidemiologic studies of epigenetic variation.
Epigenetic epidemiology; differentially methylated regions; pre-term birth; gestational age; genome-wide DNA methylation
Background During the past 5 years, high-throughput technologies have been successfully used by epidemiology studies, but almost all have focused on sequence variation through genome-wide association studies (GWAS). Today, the study of other genomic events is becoming more common in large-scale epidemiological studies. Many of these, unlike the single-nucleotide polymorphism studied in GWAS, are continuous measures. In this context, the exercise of searching for regions of interest for disease is akin to the problems described in the statistical ‘bump hunting’ literature.
Methods New statistical challenges arise when the measurements are continuous rather than categorical, when they are measured with uncertainty, and when both biological signal, and measurement errors are characterized by spatial correlation along the genome. Perhaps the most challenging complication is that continuous genomic data from large studies are measured throughout long periods, making them susceptible to ‘batch effects’. An example that combines all three characteristics is genome-wide DNA methylation measurements. Here, we present a data analysis pipeline that effectively models measurement error, removes batch effects, detects regions of interest and attaches statistical uncertainty to identified regions.
Results We illustrate the usefulness of our approach by detecting genomic regions of DNA methylation associated with a continuous trait in a well-characterized population of newborns. Additionally, we show that addressing unexplained heterogeneity like batch effects reduces the number of false-positive regions.
Conclusions Our framework offers a comprehensive yet flexible approach for identifying genomic regions of biological interest in large epidemiological studies using quantitative high-throughput methods.
Epigenetic epidemiology; DNA methylation; genome-wide analysis; bump hunting; batch effects
It has recently been proposed that variation in DNA methylation at specific genomic locations may play an important role in the development of complex diseases such as cancer. Here, we develop 1- and 2-group multiple testing procedures for identifying and quantifying regions of DNA methylation variability. Our method is the first genome-wide statistical significance calculation for increased or differential variability, as opposed to the traditional approach of testing for mean changes. We apply these procedures to genome-wide methylation data obtained from biological and technical replicates and provide the first statistical proof that variably methylated regions exist and are due to interindividual variation. We also show that differentially variable regions in colon tumor and normal tissue show enrichment of genes regulating gene expression, cell morphogenesis, and development, supporting a biological role for DNA methylation variability in cancer.
Bump finding; Functional data analysis; Multiple testing; Preprocessing; Variably methylation regions (VMRs)
454 pyrosequencing is a commonly used massively parallel DNA sequencing technology with a wide variety of application fields such as epigenetics, metagenomics and transcriptomics. A well-known problem of this platform is its sensitivity to base-calling insertion and deletion errors, particularly in the presence of long homopolymers. In addition, the base-call quality scores are not informative with respect to whether an insertion or a deletion error is more likely. Surprisingly, not much effort has been devoted to the development of improved base-calling methods and more intuitive quality scores for this platform.
We present HPCall, a 454 base-calling method based on a weighted Hurdle Poisson model. HPCall uses a probabilistic framework to call the homopolymer lengths in the sequence by modeling well-known 454 noise predictors. Base-calling quality is assessed based on estimated probabilities for each homopolymer length, which are easily transformed to useful quality scores.
Using a reference data set of the Escherichia coli K-12 strain, we show that HPCall produces superior quality scores that are very informative towards possible insertion and deletion errors, while maintaining a base-calling accuracy that is better than the current one. Given the generality of the framework, HPCall has the potential to also adapt to other homopolymer-sensitive sequencing technologies.
Early screening for cancer is arguably one of the greatest public health advances over the last fifty years. However, many cancer screening tests are invasive (digital rectal exams), expensive (mammograms, imaging) or both (colonoscopies). This has spurred growing interest in developing genomic signatures that can be used for cancer diagnosis and prognosis. However, progress has been slowed by heterogeneity in cancer profiles and the lack of effective computational prediction tools for this type of data.
We developed anti-profiles as a first step towards translating experimental findings suggesting that stochastic across-sample hyper-variability in the expression of specific genes is a stable and general property of cancer into predictive and diagnostic signatures. Using single-chip microarray normalization and quality assessment methods, we developed an anti-profile for colon cancer in tissue biopsy samples. To demonstrate the translational potential of our findings, we applied the signature developed in the tissue samples, without any further retraining or normalization, to screen patients for colon cancer based on genomic measurements from peripheral blood in an independent study (AUC of 0.89). This method achieved higher accuracy than the signature underlying commercially available peripheral blood screening tests for colon cancer (AUC of 0.81). We also confirmed the existence of hyper-variable genes across a range of cancer types and found that a significant proportion of tissue-specific genes are hyper-variable in cancer. Based on these observations, we developed a universal cancer anti-profile that accurately distinguishes cancer from normal regardless of tissue type (ten-fold cross-validation AUC > 0.92).
We have introduced anti-profiles as a new approach for developing cancer genomic signatures that specifically takes advantage of gene expression heterogeneity. We have demonstrated that anti-profiles can be successfully applied to develop peripheral-blood based diagnostics for cancer and used anti-profiles to develop a highly accurate universal cancer signature. By using single-chip normalization and quality assessment methods, no further retraining of signatures developed by the anti-profile approach would be required before their application in clinical settings. Our results suggest that anti-profiles may be used to develop inexpensive and non-invasive universal cancer screening tests.
Gene expression; Cancer; Genomic signatures; Microarray normalization and quality assessment; Anti-profiles
DNA methylation is an important epigenetic modification involved in gene regulation, which can now be measured using whole-genome bisulfite sequencing. However, cost, complexity of the data, and lack of comprehensive analytical tools are major challenges that keep this technology from becoming widely applied. Here we present BSmooth, an alignment, quality control and analysis pipeline that provides accurate and precise results even with low coverage data, appropriately handling biological replicates. BSmooth is open source software, and can be downloaded from http://rafalab.jhsph.edu/bsmooth.
First identified as histone-modifying proteins, lysine acetyltranferases (KATs) and deacetylases (KDACs) antagonize each other through modification of the side chains of lysine residues in histone proteins1. (De)acetylation of many non-histone proteins involved in chromatin, metabolism or cytoskeleton regulation were further identified in eukaryotic organisms2–6, but the corresponding modifying enzymes and substrate-specific functions of the modification are unclear. Moreover, mechanisms underlying functional specificity of individual KDACs7 remain enigmatic, and the substrate spectra of each KDAC lack comprehensive definition. Here we dissect the functional specificity of twelve critical human KDACs using a genome-wide synthetic lethality screen8–13 in cultured human cells. The genetic interaction profiles revealed enzyme-substrate relationships between individual KDACs and many important substrates governing a wide array of biological processes including metabolism, development and cell cycle progression. We further confirmed that (de)acetylation of the catalytic subunit of the adenosine monophosphate-activated protein kinase (AMPK), a critical cellular energy-sensing protein kinase complex, is controlled by the opposing catalytic activities of HDAC1 and p300. Its deacetylation enhances physical interaction with the upstream kinase LKB1, in turn leading to AMPK phosphorylation and activation, resulting in lipid breakdown in human liver cells. These findings provide new insights into previously underappreciated metabolism-regulatory roles of HDAC1 in coordinating nutrient availability and cellular responses upstream of AMPK, and demonstrate the importance of high-throughput genetic interaction profiling to elucidate functional specificity and critical substrates of individual human KDACs potentially valuable for therapeutic applications.
In systems biology, the task of reverse engineering gene pathways from data has been limited not just by the curse of dimensionality (the interaction space is huge) but also by systematic error in the data. The gene expression barcode reduces spurious association driven by batch effects and probe effects. The binary nature of the resulting expression calls lends itself perfectly to modern regularization approaches that thrive in high-dimensional settings.
The Partitioned LASSO-Patternsearch algorithm is proposed to identify patterns of multiple dichotomous risk factors for outcomes of interest in genomic studies. A partitioning scheme is used to identify promising patterns by solving many LASSO-Patternsearch subproblems in parallel. All variables that survive this stage proceed to an aggregation stage where the most significant patterns are identified by solving a reduced LASSO-Patternsearch problem in just these variables. This approach was applied to genetic data sets with expression levels dichotomized by gene expression bar code. Most of the genes and second-order interactions thus selected and are known to be related to the outcomes.
We demonstrate with simulations and data analyses that the proposed method not only selects variables and patterns more accurately, but also provides smaller models with better prediction accuracy, in comparison to several alternative methodologies.
Genotyping platforms such as Affymetrix can be used to assess genotype-phenotype as well as copy number-phenotype associations at millions of markers. While genotyping algorithms are largely concordant when assessed on HapMap samples, tools to assess copy number changes are more variable and often discordant. One explanation for the discordance is that copy number estimates are susceptible to systematic differences between groups of samples that were processed at different times or by different labs. Analysis algorithms that do not adjust for batch effects are prone to spurious measures of association. The R package crlmm implements a multilevel model that adjusts for batch effects and provides allele-specific estimates of copy number. This paper illustrates a workflow for the estimation of allele-specific copy number and integration of the marker-level estimates with complimentary Bioconductor software for inferring regions of copy number gain or loss. All analyses are performed in the statistical environment R.
copy number; batch effects; robust; multilevel model; high-throughput; oligonucleotide array
Motivation: Changes in the copy number of chromosomal DNA segments [copy number variants (CNVs)] have been implicated in human variation, heritable diseases and cancers. Microarray-based platforms are the current established technology of choice for studies reporting these discoveries and constitute the benchmark against which emergent sequence-based approaches will be evaluated. Research that depends on CNV analysis is rapidly increasing, and systematic platform assessments that distinguish strengths and weaknesses are needed to guide informed choice.
Results: We evaluated the sensitivity and specificity of six platforms, provided by four leading vendors, using a spike-in experiment. NimbleGen and Agilent platforms outperformed Illumina and Affymetrix in accuracy and precision of copy number dosage estimates. However, Illumina and Affymetrix algorithms that leverage single nucleotide polymorphism (SNP) information make up for this disadvantage and perform well at variant detection. Overall, the NimbleGen 2.1M platform outperformed others, but only with the use of an alternative data analysis pipeline to the one offered by the manufacturer.
Availability: The data is available from http://rafalab.jhsph.edu/cnvcomp/.
Contact: email@example.com; firstname.lastname@example.org; email@example.com
Supplementary information: Supplementary data are available at Bioinformatics online.
While genome-wide association studies are ongoing to identify sequence variation influencing susceptibility to major depressive disorder (MDD), epigenetic marks, such as DNA methylation, which can be influenced by environment, might also play a role. Here we present the first genome-wide DNA methylation (DNAm) scan in MDD. We compared 39 postmortem frontal cortex MDD samples to 26 controls. DNA was hybridized to our Comprehensive High-throughput Arrays for Relative Methylation (CHARM) platform, covering 3.5 million CpGs. CHARM identified 224 candidate regions with DNAm differences >10%. These regions are highly enriched for neuronal growth and development genes. Ten of 17 regions for which validation was attempted showed true DNAm differences; the greatest were in PRIMA1, with 12–15% increased DNAm in MDD (p = 0.0002–0.0003), and a concomitant decrease in gene expression. These results must be considered pilot data, however, as we could only test replication in a small number of additional brain samples (n = 16), which showed no significant difference in PRIMA1. Because PRIMA1 anchors acetylcholinesterase in neuronal membranes, decreased expression could result in decreased enzyme function and increased cholinergic transmission, consistent with a role in MDD. We observed decreased immunoreactivity for acetylcholinesterase in MDD brain with increased PRIMA1 DNAm, non-significant at p = 0.08.
While we cannot draw firm conclusions about PRIMA1 DNAm in MDD, the involvement of neuronal development genes across the set showing differential methylation suggests a role for epigenetics in the illness. Further studies using limbic system brain regions might shed additional light on this role.
DNA methylation is a key regulator of gene function in a multitude of both normal and abnormal biological processes, but tools to elucidate its roles on a genome-wide scale are still in their infancy. Methylation sensitive restriction enzymes and microarrays provide a potential high-throughput, low-cost platform to allow methylation profiling. However, accurate absolute methylation estimates have been elusive due to systematic errors and unwanted variability. Previous microarray preprocessing procedures, mostly developed for expression arrays, fail to adequately normalize methylation-related data since they rely on key assumptions that are violated in the case of DNA methylation. We develop a normalization strategy tailored to DNA methylation data and an empirical Bayes percentage methylation estimator that together yield accurate absolute methylation estimates that can be compared across samples. We illustrate the method on data generated to detect methylation differences between tissues and between normal and tumor colon samples.
DNA methylation; Epigenetics; Microarray
Tumor heterogeneity is a major barrier to effective cancer diagnosis and treatment. We recently identified cancer-specific differentially DNA-methylated regions (cDMRs) in colon cancer, which also distinguish normal tissue types from each other, suggesting that these cDMRs might be generalized across cancer types. Here we show stochastic methylation variation of the same cDMRs, distinguishing cancer from normal, in colon, lung, breast, thyroid, and Wilms tumors, with intermediate variation in adenomas. Whole genome bisulfite sequencing shows these variable cDMRs are related to loss of sharply delimited methylation boundaries at CpG islands. Furthermore, we find hypomethylation of discrete blocks encompassing half the genome, with extreme gene expression variability. Genes associated with the cDMRs and large blocks are involved in mitosis and matrix remodeling, respectively. These data suggest a model for cancer involving loss of epigenetic stability of well-defined genomic domains that underlies increased methylation variability in cancer and could contribute to tumor heterogeneity.
The ability to measure gene expression on a genome-wide scale is one of the most promising accomplishments in molecular biology. Microarrays, the technology that first permitted this, were riddled with problems due to unwanted sources of variability. Many of these problems are now mitigated, after a decade's worth of statistical methodology development. The recently developed RNA sequencing (RNA-seq) technology has generated much excitement in part due to claims of reduced variability in comparison to microarrays. However, we show that RNA-seq data demonstrate unwanted and obscuring variability similar to what was first observed in microarrays. In particular, we find guanine-cytosine content (GC-content) has a strong sample-specific effect on gene expression measurements that, if left uncorrected, leads to false positives in downstream results. We also report on commonly observed data distortions that demonstrate the need for data normalization. Here, we describe a statistical methodology that improves precision by 42% without loss of accuracy. Our resulting conditional quantile normalization algorithm combines robust generalized regression to remove systematic bias introduced by deterministic features such as GC-content and quantile normalization to correct for global distortions.
Gene expression; Normalization; RNA sequencing
RNA sequencing has generated much excitement for the advantages offered over microarrays. This excitement has led to a barrage of publications discounting the importance of biological variability; as microarray publications did in the 1990s. By comparing microarray and sequencing data, we demonstrate that expression measurements exhibit biological variability across individuals irrespective of measurement technology. Our analysis suggests RNA-sequencing experiments designed to estimate biological variability are more likely to produce reproducible results.
Submicroscopic changes in chromosomal DNA copy number dosage are common and have been implicated in many heritable diseases and cancers. Recent high-throughput technologies have a resolution that permits the detection of segmental changes in DNA copy number that span thousands of base pairs in the genome. Genomewide association studies (GWAS) may simultaneously screen for copy number phenotype and single nucleotide polymorphism (SNP) phenotype associations as part of the analytic strategy. However, genomewide array analyses are particularly susceptible to batch effects as the logistics of preparing DNA and processing thousands of arrays often involves multiple laboratories and technicians, or changes over calendar time to the reagents and laboratory equipment. Failure to adjust for batch effects can lead to incorrect inference and requires inefficient post hoc quality control procedures to exclude regions that are associated with batch. Our work extends previous model-based approaches for copy number estimation by explicitly modeling batch and using shrinkage to improve locus-specific estimates of copy number uncertainty. Key features of this approach include the use of biallelic genotype calls from experimental data to estimate batch-specific and locus-specific parameters of background and signal without the requirement of training data. We illustrate these ideas using a study of bipolar disease and a study of chromosome 21 trisomy. The former has batch effects that dominate much of the observed variation in the quantile-normalized intensities, while the latter illustrates the robustness of our approach to a data set in which approximately 27% of the samples have altered copy number. Locus-specific estimates of copy number can be plotted on the copy number scale to investigate mosaicism and guide the choice of appropriate downstream approaches for smoothing the copy number as a function of physical position. The software is open source and implemented in the R package crlmm at Bioconductor (http:www.bioconductor.org).
Bioinformatics; Hierarchical models; DNA copy number variations; Single nucleotide polymorphism array
Considerable time and effort has been spent in developing analysis and quality assessment methods to allow the use of microarrays in a clinical setting. As is the case for microarrays and other high-throughput technologies, data from new high-throughput sequencing technologies are subject to technological and biological biases and systematic errors that can impact downstream analyses. Only when these issues can be readily identified and reliably adjusted for will clinical applications of these new technologies be feasible. Although much work remains to be done in this area, we describe consistently observed biases that should be taken into account when analyzing high-throughput sequencing data. In this article, we review current knowledge about these biases, discuss their impact on analysis results, and propose solutions.
Human immunodeficiency virus type 1 (HIV-1) establishes a latent reservoir in resting memory CD4+ T cells. This latent reservoir is a major barrier to the eradication of HIV-1 in infected individuals and is not affected by highly active antiretroviral therapy (HAART). Reactivation of latent HIV-1 is a possible strategy for elimination of this reservoir. The mechanisms with which latency is maintained are unclear. In the analysis of the regulation of HIV-1 gene expression, it is important to consider the nature of HIV-1 integration sites. In this study, we analyzed the integration and transcription of latent HIV-1 in a primary CD4+ T cell model of latency. The majority of integration sites in latently infected cells were in introns of transcription units. Serial analysis of gene expression (SAGE) demonstrated that more than 90% of those host genes harboring a latent integrated provirus were transcriptionally active, mostly at high levels. For latently infected cells, we observed a modest preference for integration in the same transcriptional orientation as the host gene (63.8% versus 36.2%). In contrast, this orientation preference was not observed in acutely infected or persistently infected cells. These results suggest that transcriptional interference may be one of the important factors in the establishment and maintenance of HIV-1 latency. Our findings suggest that disrupting the negative control of HIV-1 transcription by upstream host promoters could facilitate the reactivation of latent HIV-1 in some resting CD4+ T cells.
Motivation: The availability of flexible open source software for the analysis of gene expression raw level data has greatly facilitated the development of widely used preprocessing methods for these technologies. However, the expansion of microarray applications has exposed the limitation of existing tools.
Results: We developed the oligo package to provide a more general solution that supports a wide range of applications. The package is based on the BioConductor principles of transparency, reproducibility and efficiency of development. It extends the existing tools and leverages existing code for visualization, accessing data and widely used preprocessing routines. The oligo package implements a unified paradigm for preprocessing data and interfaces with other BioConductor tools for downstream analysis. Our infrastructure is general and can be used by other BioConductor packages.
Availability: The oligo package is freely available through BioConductor, http://www.bioconductor.org.
Contact: firstname.lastname@example.org; email@example.com
Supplementary information: Supplementary data are available at Bioinformatics online.
Sequencing targeted DNA regions in large samples is necessary to discover the full spectrum of rare variants. We report an effective Illumina sequencing strategy utilizing pooled samples with novel quality (Srfim) and filtering (SERVIC4E) algorithms. We sequenced 24 exons in two cohorts of 480 samples each, identifying 47 coding variants, including 30 present once per cohort. Validation by Sanger sequencing revealed an excellent combination of sensitivity and specificity for variant detection in pooled samples of both cohorts as compared to publicly available algorithms.