Several studies using test-day models show clear heterogeneity of residual variance along lactation. A changepoint technique to account for this heterogeneity is proposed. The data set included 100 744 test-day records of 10 869 Holstein-Friesian cows from northern Spain. A three-stage hierarchical model using the Wood lactation function was employed. Two unknown changepoints at times T1 and T2, (0
changepoint; heterogeneity; residual variance
Motivation: The complexity of a large number of recently discovered copy number polymorphisms is much higher than initially thought, thus making it more difficult to detect them in the presence of significant measurement noise. In this scenario, separate normalization and segmentation is prone to lead to many false detections of changes in copy number. New approaches capable of jointly modeling the copy number and the non-copy number (noise) hybridization effects across multiple samples will potentially lead to more accurate results.
Methods: In this article, the genome alteration detection analysis (GADA) approach introduced in our previous work is extended to a multiple sample model. The copy number component is independent for each sample and uses a sparse Bayesian prior, while the reference hybridization level is not necessarily sparse but identical on all samples. The expectation maximization (EM) algorithm used to fit the model iteratively determines whether the observed hybridization levels are more likely due to a copy number variation or to a shared hybridization bias.
Results: The new proposed approach is compared with the currently used strategy of separate normalization followed by independent segmentation of each array. Real microarray data obtained from HapMap samples are randomly partitioned to create different reference sets. Using the new approach, copy number and reference intensity estimates are significantly less variable if the reference set changes; and a higher consistency on copy numbers detected within HapMap family trios is obtained. Finally, the running time to fit the model grows linearly in the number samples and probes.
Contact: email@example.com; firstname.lastname@example.org
Supplementary information:Supplementary data are available at Bioinformatics online.
DNA copy number variation (CNV) accounts for a large proportion of genetic variation. One commonly used approach to detecting CNVs is array-based comparative genomic hybridization (aCGH). Although many methods have been proposed to analyze aCGH data, it is not clear how to combine information from multiple samples to improve CNV detection. In this paper, we propose to use a matrix to approximate the multisample aCGH data and minimize the total variation of each sample as well as the nuclear norm of the whole matrix. In this way, we can make use of the smoothness property of each sample and the correlation among multiple samples simultaneously in a convex optimization framework. We also developed an efficient and scalable algorithm to handle large-scale data. Experiments demonstrate that the proposed method outperforms the state-of-the-art techniques under a wide range of scenarios and it is capable of processing large data sets with millions of probes.
CNV; aCGH; total variation; spectral regularization; convex optimization
High-density SNP microarrays provide a useful tool for the detection of copy number variants (CNVs). The analysis of such large amounts of data is complicated, especially with regard to determining where copy numbers change and their corresponding values. In this paper, we propose a Bayesian multiple change point model (BMCP) for segmentation and estimation of SNP microarray data. Segmentation concerns separating a chromosome into regions of equal copy number differences between the sample of interest and some reference, and involves the detection of locations of copy number difference changes. Estimation concerns determining true copy number for each segment. Our approach not only gives posterior estimates for the parameters of interest, namely locations for copy number difference changes and true copy number estimates, but also useful confidence measures. In addition, our algorithm can segment multiple samples simultaneously, and infer both common and rare CNVs across individuals. Finally, for studies of CNVs in tumors, we incorporate an adjustment factor for signal attenuation due to tumor heterogeneity or normal contamination that can improve copy number estimates.
Bayesian multiple change points; copy number variant; estimation; segmentation; signal attenuation; SNP microarrays
Motivated by DNA copy number variation (CNV) analysis based on high-density single nucleotide polymorphism (SNP) data, we consider the problem of detecting and identifying sparse short segments in a long one-dimensional sequence of data with additive Gaussian white noise, where the number, length and location of the segments are unknown. We present a statistical characterization of the identifiable region of a segment where it is possible to reliably separate the segment from noise. An efficient likelihood ratio selection (LRS) procedure for identifying the segments is developed, and the asymptotic optimality of this method is presented in the sense that the LRS can separate the signal segments from the noise as long as the signal segments are in the identifiable regions. The proposed method is demonstrated with simulations and analysis of a real data set on identification of copy number variants based on high-density SNP data. The results show that the LRS procedure can yield greater gain in power for detecting the true segments than some standard signal identification methods.
Likelihood ratio selection; signal detection; multiple testing; DNA copy number
Cancer progression is associated with genomic instability and an accumulation of gains and losses of DNA. The growing variety of tools for measuring genomic copy numbers, including various types of array-CGH, SNP arrays and high-throughput sequencing, calls for a coherent framework offering unified and consistent handling of single- and multi-track segmentation problems. In addition, there is a demand for highly computationally efficient segmentation algorithms, due to the emergence of very high density scans of copy number.
A comprehensive Bioconductor package for copy number analysis is presented. The package offers a unified framework for single sample, multi-sample and multi-track segmentation and is based on statistically sound penalized least squares principles. Conditional on the number of breakpoints, the estimates are optimal in the least squares sense. A novel and computationally highly efficient algorithm is proposed that utilizes vector-based operations in R. Three case studies are presented.
The R package copynumber is a software suite for segmentation of single- and multi-track copy number data using algorithms based on coherent least squares principles.
Copy number; aCGH; Segmentation; Allele-specific segmentation; Penalized regression; Least squares; Bioconductor
Motivation: Human genomic variability occurs at different scales, from single nucleotide polymorphisms (SNPs) to large DNA segments. Copy number variations (CNVs) represent a significant part of our genetic heterogeneity and have also been associated with many diseases and disorders. Short, localized CNVs, which may play an important role in human disease, may be undetectable in noisy genomic data. Therefore, robust methodologies are needed for their detection. Furthermore, for meaningful identification of pathological CNVs, estimation of normal allelic aberrations is necessary.
Results: We developed a signal processing-based methodology for sequence denoising followed by pattern matching, to increase SNR in genomic data and improve CNV detection. We applied this signal-decomposition-matched filtering (SDMF) methodology to 429 normal genomic sequences, and compared detected CNVs to those in the Database of Genomic Variants. SDMF successfully detected a significant number of previously identified CNVs with frequencies of occurrence ≥10%, as well as unreported short CNVs. Its performance was also compared to circular binary segmentation (CBS). through simulations. SDMF had a significantly lower false detection rate and was significantly faster than CBS, an important advantage for handling large datasets generated with high-resolution arrays. By focusing on improving SNR (instead of the robustness of the detection algorithm), SDMF is a very promising methodology for identifying CNVs at all genomic spatial scales.
Availability: The data are available at http://tcga-data.nci.nih.gov/tcga/ The software and list of analyzed sequence IDs are available at http://www.hsph.harvard.edu/~betensky/ A Matlab code for Empirical Mode Decomposition may be found at: http://www.clear.rice.edu/elec301/Projects02/empiricalMode/code.html
Multisample, nonindexed pooling combined with next-generation sequencing (NGS) was used to discover RET proto-oncogene sequence variation within a cohort known to be unaffected by multiple endocrine neoplasia type 2 (MEN2). DNA samples (113 Caucasians, 23 persons of other ethnicities) were amplified for RET intron 9 to intron 16 and then divided into 5 pools of <30 samples each before library prep and NGS. Two controls were included in this study, a single sample and a pool of 50 samples that had been previously sequenced by the same NGS methods. All 59 variants previously detected in the 50-pool control were present. Of the 61 variants detected in the unaffected cohort, 20 variants were novel changes. Several variants were validated by high-resolution melting analysis and Sanger sequencing, and their allelic frequencies correlated well with those determined by NGS. The results from this unaffected cohort will be added to the RET MEN2 database.
The standard approach of using tumor doubling time to assess growth delay may not accurately represent tumor response, especially if the growth rates are not constant. Therefore, we developed a method to compare the antitumor activities of different treatments in xenograft experiments that uses the entire growth curve to estimate non-constant growth rates.
A Bayesian hierarchical changepoint (BHC) method was used to model logarithmically transformed tumor volumes. Each tumor was assumed to have a growth profile, represented by a pre-nadir regression rate, a regression period, a nadir volume, and a post-nadir regrowth rate. Confidence intervals were calculated to compare these features between different treatments. We used data from a study assessing the effects of radiation, gemcitabine, and a Chk1/2 inhibitor on MiaPaCa-2 xenografts.
We found that the BHC model provided a good fit to the data and more descriptive features than the tumor doubling approach. This model detected significant tumor regression in the AZD7762+1Gy and GEM+1Gy that was not detected when comparing the tumor doubling times. The BHC model also provided evidence that the growth inhibition resulted from a direct tumor effect rather than an indirect effect on the tumor bed, as evidenced by dramatic tumor regression in response to effective treatments and similar post-nadir regrowth rates across all treatment groups.
Compared with the tumor doubling time approach, the BHC model utilizes all data, providing more descriptive features that address mechanisms underlying tumor growth inhibition and maximize the biological information obtained from tumor xenografts studies.
Xenograft model; changepoint model; tumor doubling time; growth delay; tumor growth profile; Bayesian model
Comparative genomic hybridization (CGH) microarrays have been used to determine copy number variations (CNVs) and their effects on complex diseases. Detection of absolute CNVs independent of genomic variants of an arbitrary reference sample has been a critical issue in CGH array experiments. Whole genome analysis using massively parallel sequencing with multiple ultra-high resolution CGH arrays provides an opportunity to catalog highly accurate genomic variants of the reference DNA (NA10851). Using information on variants, we developed a new method, the CGH array reference-free algorithm (CARA), which can determine reference-unbiased absolute CNVs from any CGH array platform. The algorithm enables the removal and rescue of false positive and false negative CNVs, respectively, which appear due to the effects of genomic variants of the reference sample in raw CGH array experiments. We found that the CARA remarkably enhanced the accuracy of CGH array in determining absolute CNVs. Our method thus provides a new approach to interpret CGH array data for personalized medicine.
Copy Number Variants (CNVs) are important genetic factors for studying human diseases. While high-throughput whole genome re-sequencing provides multiple lines of evidence for detecting CNVs, computational algorithms need to be tailored for different type or size of CNVs under different experimental designs.
To achieve optimal power and resolution of detecting CNVs at low depth of coverage, we implemented a Hidden Markov Model that integrates both depth of coverage and mate-pair relationship. The novelty of our algorithm is that we infer the likelihood of carrying a deletion jointly from multiple mate pairs in a region without the requirement of a single mate pairs being obvious outliers. By integrating all useful information in a comprehensive model, our method is able to detect medium-size deletions (200-2000bp) at low depth (<10× per sample). We applied the method to simulated data and demonstrate the power of detecting medium-size deletions is close to theoretical values.
A program implemented in Java, Zinfandel, is available at http://www.cs.columbia.edu/~itsik/zinfandel/
Accurate information on haplotypes and diplotypes (haplotype pairs) is required for population-genetic analyses; however, microarrays do not provide data on a haplotype or diplotype at a copy number variation (CNV) locus; they only provide data on the total number of copies over a diplotype or an unphased sequence genotype (e.g., AAB, unlike AB of single nucleotide polymorphism). Moreover, such copy numbers or genotypes are often incorrectly determined when microarray signal intensities derived from different copy numbers or genotypes are not clearly separated due to noise. Here we report an algorithm to infer CNV haplotypes and individuals’ diplotypes at multiple loci from noisy microarray data, utilizing the probability that a signal intensity may be derived from different underlying copy numbers or genotypes. Performing simulation studies based on known diplotypes and an error model obtained from real microarray data, we demonstrate that this probabilistic approach succeeds in accurate inference (error rate: 1–2%) from noisy data, whereas previous deterministic approaches failed (error rate: 12–18%). Applying this algorithm to real microarray data, we estimated haplotype frequencies and diplotypes in 1486 CNV regions for 100 individuals. Our algorithm will facilitate accurate population-genetic analyses and powerful disease association studies of CNVs.
copy number variation; EM algorithm; haplotype inference; phasing
Motivation: Next-generation sequencing presents several statistical challenges, with one of the most fundamental being determining an individual's genotype from multiple aligned short read sequences at a position. Some simple approaches for genotype calling apply fixed filters, such as calling a heterozygote if more than a specified percentage of the reads have variant nucleotide calls. Other genotype-calling methods, such as MAQ and SOAPsnp, are implementations of Bayes classifiers in that they classify genotypes using posterior genotype probabilities.
Results: Here, we propose a novel genotype-calling algorithm that, in contrast to the other methods, estimates parameters underlying the posterior probabilities in an adaptive way rather than arbitrarily specifying them a priori. The algorithm, which we call SeqEM, applies the well-known Expectation-Maximization algorithm to an appropriate likelihood for a sample of unrelated individuals with next-generation sequence data, leveraging information from the sample to estimate genotype probabilities and the nucleotide-read error rate. We demonstrate using analytic calculations and simulations that SeqEM results in genotype-call error rates as small as or smaller than filtering approaches and MAQ. We also apply SeqEM to exome sequence data in eight related individuals and compare the results to genotypes from an Illumina SNP array, showing that SeqEM behaves well in real data that deviates from idealized assumptions.
Conclusion: SeqEM offers an improved, robust and flexible genotype-calling approach that can be widely applied in the next-generation sequencing studies.
Availability and implementation: Software for SeqEM is freely available from our website: www.hihg.org under Software Download.
Supplementary information: Supplementary data are available at Bioinformatics online.
Over recent years small submicroscopic DNA copy-number variants (CNVs) have been highlighted as an important source of variation in the human genome, human phenotypic diversity and disease susceptibility. Consequently, there is a pressing need for the development of methods that allow the efficient, accurate and cheap measurement of genomic copy number polymorphisms in clinical cohorts. We have developed a simple competitive PCR based method to determine DNA copy number which uses the entire genome of a single chimpanzee as a competitor thus eliminating the requirement for competitive sequences to be synthesized for each assay. This results in the requirement for only a single reference sample for all assays and dramatically increases the potential for large numbers of loci to be analysed in multiplex. In this study we establish proof of concept by accurately detecting previously characterized mutations at the PARK2 locus and then demonstrating the potential of quantitative interspecies competitive PCR (qicPCR) to accurately genotype CNVs in association studies by analysing chromosome 22q11 deletions in a sample of previously characterized patients and normal controls.
Motivation: DNA copy number aberration (CNA) is a hallmark of genomic abnormality in tumor cells. Recurrent CNA (RCNA) occurs in multiple cancer samples across the same chromosomal region and has greater implication in tumorigenesis. Current commonly used methods for RCNA identification require CNA calling for individual samples before cross-sample analysis. This two-step strategy may result in a heavy computational burden, as well as a loss of the overall statistical power due to segmentation and discretization of individual sample's data. We propose a population-based approach for RCNA detection with no need of single-sample analysis, which is statistically powerful, computationally efficient and particularly suitable for high-resolution and large-population studies.
Results: Our approach, correlation matrix diagonal segmentation (CMDS), identifies RCNAs based on a between-chromosomal-site correlation analysis. Directly using the raw intensity ratio data from all samples and adopting a diagonal transformation strategy, CMDS substantially reduces computational burden and can obtain results very quickly from large datasets. Our simulation indicates that the statistical power of CMDS is higher than that of single-sample CNA calling based two-step approaches. We applied CMDS to two real datasets of lung cancer and brain cancer from Affymetrix and Illumina array platforms, respectively, and successfully identified known regions of CNA associated with EGFR, KRAS and other important oncogenes. CMDS provides a fast, powerful and easily implemented tool for the RCNA analysis of large-scale data from cancer genomes.
Availability: The R and C programs implementing our method are available at https://dsgweb.wustl.edu/qunyuan/software/cmds.
Supplementary information: Supplementary data are available at Bioinformatics online.
Variations in DNA copy number carry information on the modalities of genome evolution and mis-regulation of DNA replication in cancer cells. Their study can help localize tumor suppressor genes, distinguish different populations of cancerous cells, and identify genomic variations responsible for disease phenotypes. A number of different high throughput technologies can be used to identify copy number variable sites, and the literature documents multiple effective algorithms. We focus here on the specific problem of detecting regions where variation in copy number is relatively common in the sample at hand. This problem encompasses the cases of copy number polymorphisms, related samples, technical replicates, and cancerous sub-populations from the same individual.
We present a segmentation method named generalized fused lasso (GFL) to reconstruct copy number variant regions. GFL is based on penalized estimation and is capable of processing multiple signals jointly. Our approach is computationally very attractive and leads to sensitivity and specificity levels comparable to those of state-of-the-art specialized methodologies. We illustrate its applicability with simulated and real data sets.
The flexibility of our framework makes it applicable to data obtained with a wide range of technology. Its versatility and speed make GFL particularly useful in the initial screening stages of large data sets.
Copy number variant; Copy number polymorphism; Fused lasso; Group fused lasso; MM algorithm
Accurate estimation of DNA copy numbers from array comparative genomic hybridization (CGH) data is important for characterizing the cancer genome. An important part of this process is the segmentation of the log-ratios between the sample and control DNA along the chromosome into regions of different copy numbers. However, multiple algorithms are available in the literature for this procedure and the results can vary substantially among these. Thus, a visualization tool that can display the segmented profiles from a number of methods can be helpful to the biologist or the clinician to ascertain that a feature of interest did not arise as an artifact of the algorithm. Such a tool also allows the methodologist to easily contrast his method against others.
We developed a web-based tool that applies a number of popular algorithms to a single array CGH profile entered by the user. It generates a heatmap panel of the segmented profiles for each method as well as a consensus profile. The clickable heatmap can be moved along the chromosome and zoomed in or out. It also displays the time that each algorithm took and provides numerical values of the segmented profiles for download. The web interface calls algorithms written in the statistical language R. We encourage developers of new algorithms to submit their routines to be incorporated into the website.
Motivation: With improved short-read assembly algorithms and the recent development of long-read sequencers, split mapping will soon be the preferred method for structural variant (SV) detection. Yet, current alignment tools are not well suited for this.
Results: We present YAHA, a fast and flexible hash-based aligner. YAHA is as fast and accurate as BWA-SW at finding the single best alignment per query and is dramatically faster and more sensitive than both SSAHA2 and MegaBLAST at finding all possible alignments. Unlike other aligners that report all, or one, alignment per query, or that use simple heuristics to select alignments, YAHA uses a directed acyclic graph to find the optimal set of alignments that cover a query using a biologically relevant breakpoint penalty. YAHA can also report multiple mappings per defined segment of the query. We show that YAHA detects more breakpoints in less time than BWA-SW across all SV classes, and especially excels at complex SVs comprising multiple breakpoints.
Availability: YAHA is currently supported on 64-bit Linux systems. Binaries and sample data are freely available for download from http://faculty.virginia.edu/irahall/YAHA.
Copy number variants (CNVs) are alternations of DNA of a genome that results in the cell having a less or more than two copies of segments of the DNA. CNVs correspond to relatively large regions of the genome, ranging from about one kilobase to several megabases, that are deleted or duplicated. Motivated by CNV analysis based on next generation sequencing data, we consider the problem of detecting and identifying sparse short segments hidden in a long linear sequence of data with an unspecified noise distribution. We propose a computationally efficient method that provides a robust and near-optimal solution for segment identification over a wide range of noise distributions. We theoretically quantify the conditions for detecting the segment signals and show that the method near-optimally estimates the signal segments whenever it is possible to detect their existence. Simulation studies are carried out to demonstrate the efficiency of the method under different noise distributions. We present results from a CNV analysis of a HapMap Yoruban sample to further illustrate the theory and the methods.
Robust segment detector; Robust segment identifier; optimality; DNA copy number variant; next generation sequencing data
Motivation: The discovery of genomic structural variants (SVs) at high sensitivity and specificity is an essential requirement for characterizing naturally occurring variation and for understanding pathological somatic rearrangements in personal genome sequencing data. Of particular interest are integrated methods that accurately identify simple and complex rearrangements in heterogeneous sequencing datasets at single-nucleotide resolution, as an optimal basis for investigating the formation mechanisms and functional consequences of SVs.
Results: We have developed an SV discovery method, called DELLY, that integrates short insert paired-ends, long-range mate-pairs and split-read alignments to accurately delineate genomic rearrangements at single-nucleotide resolution. DELLY is suitable for detecting copy-number variable deletion and tandem duplication events as well as balanced rearrangements such as inversions or reciprocal translocations. DELLY, thus, enables to ascertain the full spectrum of genomic rearrangements, including complex events. On simulated data, DELLY compares favorably to other SV prediction methods across a wide range of sequencing parameters. On real data, DELLY reliably uncovers SVs from the 1000 Genomes Project and cancer genomes, and validation experiments of randomly selected deletion loci show a high specificity.
Availability: DELLY is available at www.korbel.embl.de/software.html
Motivation: High-throughput techniques facilitate the simultaneous measurement of DNA copy number at hundreds of thousands of sites on a genome. Older techniques allow measurement only of total copy number, the sum of the copy number contributions from the two parental chromosomes. Newer single nucleotide polymorphism (SNP) techniques can in addition enable quantifying parent-specific copy number (PSCN). The raw data from such experiments are two-dimensional, but are unphased. Consequently, inference based on them necessitates development of new analytic methods.
Methods: We have adapted and enhanced the circular binary segmentation (CBS) algorithm for this purpose with focus on paired test and reference samples. The essence of paired parent-specific CBS (Paired PSCBS) is to utilize the original CBS algorithm to identify regions of equal total copy number and then to further segment these regions where there have been changes in PSCN. For the final set of regions, calls are made of equal parental copy number and loss of heterozygosity (LOH). PSCN estimates are computed both before and after calling.
Results: The methodology is evaluated by simulation and on glioblastoma data. In the simulation, PSCBS compares favorably to established methods. On the glioblastoma data, PSCBS identifies interesting genomic regions, such as copy-neutral LOH.
Availability: The Paired PSCBS method is implemented in an open-source R package named PSCBS, available on CRAN (http://cran.r-project.org/).
Supplementary information: Supplementary data are available at Bioinformatics online.
Recent studies suggest that copy number polymorphisms (CNPs) may play an important role in disease susceptibility and onset. Currently, the detection of CNPs mainly depends on microarray technology. For case-control studies, conventionally, subjects are assigned to a specific CNP category based on the continuous quantitative measure produced by microarray experiments, and cases and controls are then compared using a chi-square test of independence. The purpose of this work is to specify the likelihood ratio test statistic (LRTS) for case-control sampling design based on the underlying continuous quantitative measurement, and to assess its power and relative efficiency (as compared to the chi-square test of independence on CNP counts). The sample size and power formulas of both methods are given. For the latter, the CNPs are classified using the Bayesian classification rule. The LRTS is more powerful than this chi-square test for the alternatives considered, especially alternatives in which the at-risk CNP categories have low frequencies. An example of the application of the LRTS is given for a comparison of CNP distributions in individuals of Caucasian or Taiwanese ethnicity, where the LRTS appears to be more powerful than the chi-square test, possibly due to misclassification of the most common CNP category into a less common category.