PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (1119298)

Clipboard (0)
None

Related Articles

1.  Joint estimation of DNA copy number from multiple platforms 
Bioinformatics  2009;26(2):153-160.
Motivation: DNA copy number variants (CNVs) are gains and losses of segments of chromosomes, and comprise an important class of genetic variation. Recently, various microarray hybridization-based techniques have been developed for high-throughput measurement of DNA copy number. In many studies, multiple technical platforms or different versions of the same platform were used to interrogate the same samples; and it became necessary to pool information across these multiple sources to derive a consensus molecular profile for each sample. An integrated analysis is expected to maximize resolution and accuracy, yet currently there is no well-formulated statistical method to address the between-platform differences in probe coverage, assay methods, sensitivity and analytical complexity.
Results: The conventional approach is to apply one of the CNV detection (‘segmentation’) algorithms to search for DNA segments of altered signal intensity. The results from multiple platforms are combined after segmentation. Here we propose a new method, Multi-Platform Circular Binary Segmentation (MPCBS), which pools statistical evidence across platforms during segmentation, and does not require pre-standardization of different data sources. It involves a weighted sum of t-statistics, which arises naturally from the generalized log-likelihood ratio of a multi-platform model. We show by comparing the integrated analysis of Affymetrix and Illumina SNP array data with Agilent and fosmid clone end-sequencing results on eight HapMap samples that MPCBS achieves improved spatial resolution, detection power and provides a natural consensus across platforms. We also apply the new method to analyze multi-platform data for tumor samples.
Availability: The R package for MPCBS is registered on R-Forge (http://r-forge.r-project.org/) under project name MPCBS.
Contact: nzhang@stanford.edu; junzli@umich.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btp653
PMCID: PMC2852203  PMID: 19933593
2.  Joint estimation of copy number variation and reference intensities on multiple DNA arrays using GADA 
Bioinformatics  2009;25(10):1223-1230.
Motivation: The complexity of a large number of recently discovered copy number polymorphisms is much higher than initially thought, thus making it more difficult to detect them in the presence of significant measurement noise. In this scenario, separate normalization and segmentation is prone to lead to many false detections of changes in copy number. New approaches capable of jointly modeling the copy number and the non-copy number (noise) hybridization effects across multiple samples will potentially lead to more accurate results.
Methods: In this article, the genome alteration detection analysis (GADA) approach introduced in our previous work is extended to a multiple sample model. The copy number component is independent for each sample and uses a sparse Bayesian prior, while the reference hybridization level is not necessarily sparse but identical on all samples. The expectation maximization (EM) algorithm used to fit the model iteratively determines whether the observed hybridization levels are more likely due to a copy number variation or to a shared hybridization bias.
Results: The new proposed approach is compared with the currently used strategy of separate normalization followed by independent segmentation of each array. Real microarray data obtained from HapMap samples are randomly partitioned to create different reference sets. Using the new approach, copy number and reference intensity estimates are significantly less variable if the reference set changes; and a higher consistency on copy numbers detected within HapMap family trios is obtained. Finally, the running time to fit the model grows linearly in the number samples and probes.
Availability:http://biron.usc.edu/∼piquereg/GADA
Contact: rpique@ieee.org; shahab@chla.usc.edu
Supplementary information:Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btp119
PMCID: PMC2732310  PMID: 19276152
3.  Systematic Inference of Copy-Number Genotypes from Personal Genome Sequencing Data Reveals Extensive Olfactory Receptor Gene Content Diversity 
PLoS Computational Biology  2010;6(11):e1000988.
Copy-number variations (CNVs) are widespread in the human genome, but comprehensive assignments of integer locus copy-numbers (i.e., copy-number genotypes) that, for example, enable discrimination of homozygous from heterozygous CNVs, have remained challenging. Here we present CopySeq, a novel computational approach with an underlying statistical framework that analyzes the depth-of-coverage of high-throughput DNA sequencing reads, and can incorporate paired-end and breakpoint junction analysis based CNV-analysis approaches, to infer locus copy-number genotypes. We benchmarked CopySeq by genotyping 500 chromosome 1 CNV regions in 150 personal genomes sequenced at low-coverage. The assessed copy-number genotypes were highly concordant with our performed qPCR experiments (Pearson correlation coefficient 0.94), and with the published results of two microarray platforms (95–99% concordance). We further demonstrated the utility of CopySeq for analyzing gene regions enriched for segmental duplications by comprehensively inferring copy-number genotypes in the CNV-enriched >800 olfactory receptor (OR) human gene and pseudogene loci. CopySeq revealed that OR loci display an extensive range of locus copy-numbers across individuals, with zero to two copies in some OR loci, and two to nine copies in others. Among genetic variants affecting OR loci we identified deleterious variants including CNVs and SNPs affecting ∼15% and ∼20% of the human OR gene repertoire, respectively, implying that genetic variants with a possible impact on smell perception are widespread. Finally, we found that for several OR loci the reference genome appears to represent a minor-frequency variant, implying a necessary revision of the OR repertoire for future functional studies. CopySeq can ascertain genomic structural variation in specific gene families as well as at a genome-wide scale, where it may enable the quantitative evaluation of CNVs in genome-wide association studies involving high-throughput sequencing.
Author Summary
Human individual genome sequencing has recently become affordable, enabling highly detailed genetic sequence comparisons. While the identification and genotyping of single-nucleotide polymorphisms has already been successfully established for different sequencing platforms, the detection, quantification and genotyping of large-scale copy-number variants (CNVs), i.e., losses or gains of long genomic segments, has remained challenging. We present a computational approach that enables detecting CNVs in sequencing data and accurately identifies the actual copy-number at which DNA segments of interest occur in an individual genome. This approach enabled us to obtain novel insights into the largest human gene family – the olfactory receptors (ORs) – involved in smell perception. While previous studies reported an abundance of CNVs in ORs, our approach enabled us to globally identify absolute differences in OR gene counts that exist between humans. While several OR genes have very high gene counts, other ORs are found only once or are missing entirely in some individuals. The latter have a particularly high probability of influencing individual differences in the perception of smell, a question that future experimental efforts can now address. Furthermore, we observed differences in OR gene counts between populations, pointing at ORs that might contribute to population-specific differences in smell.
doi:10.1371/journal.pcbi.1000988
PMCID: PMC2978733  PMID: 21085617
4.  Cheek swabs, SNP chips, and CNVs: Assessing the quality of copy number variant calls generated with subject-collected mail-in buccal brush DNA samples on a high-density genotyping microarray 
BMC Medical Genetics  2012;13:51.
Background
Multiple investigators have established the feasibility of using buccal brush samples to genotype single nucleotide polymorphisms (SNPs) with high-density genome-wide microarrays, but there is currently no consensus on the accuracy of copy number variants (CNVs) inferred from these data. Regardless of the source of DNA, it is more difficult to detect CNVs than to genotype SNPs using these microarrays, and it therefore remains an open question whether buccal brush samples provide enough high-quality DNA for this purpose.
Methods
To demonstrate the quality of CNV calls generated from DNA extracted from buccal samples, compared to calls generated from blood samples, we evaluated the concordance of calls from individuals who provided both sample types. The Illumina Human660W-Quad BeadChip was used to determine SNPs and CNVs of 39 Arkansas participants in the National Birth Defects Prevention Study (NBDPS), including 16 mother-infant dyads, who provided both whole blood and buccal brush DNA samples.
Results
We observed a 99.9% concordance rate of SNP calls in the 39 blood–buccal pairs. From the same dataset, we performed a similar analysis of CNVs. Each of the 78 samples was independently segmented into regions of like copy number using the Optimal Segmentation algorithm of Golden Helix SNP & Variation Suite 7.
Across 640,663 loci on 22 autosomal chromosomes, segment-mean log R ratios had an average correlation of 0.899 between blood-buccal pairs of samples from the same individual, while the average correlation between all possible blood-buccal pairs of samples from unrelated individuals was 0.318. An independent analysis using the QuantiSNP algorithm produced average correlations of 0.943 between blood-buccal pairs from the same individual versus 0.332 between samples from unrelated individuals.
Segment-mean log R ratios had an average correlation of 0.539 between mother-offspring dyads of buccal samples, which was not statistically significantly different than the average correlation of 0.526 between mother-offspring dyads of blood samples (p=0.302).
Conclusions
We observed performance from the subject-collected mail-in buccal brush samples comparable to that of blood. These results show that such DNA samples can be used for genome-wide scans of both SNPs and CNVs, and that high rates of CNV concordance were achieved whether using a change-point-based algorithm or one based on a hidden Markov model (HMM).
doi:10.1186/1471-2350-13-51
PMCID: PMC3506514  PMID: 22734463
SNPs, Single nucleotide polymorphisms; CNVs, Copy number variants; NBDPS, National Birth Defects Prevention Study; Buccal brush
5.  Segmentation and Estimation for SNP Microarrays: a Bayesian Multiple Change Point Approach 
Biometrics  2010;66(3):675-683.
Summary
High-density SNP microarrays provide a useful tool for the detection of copy number variants (CNVs). The analysis of such large amounts of data is complicated, especially with regard to determining where copy numbers change and their corresponding values. In this paper, we propose a Bayesian multiple change point model (BMCP) for segmentation and estimation of SNP microarray data. Segmentation concerns separating a chromosome into regions of equal copy number differences between the sample of interest and some reference, and involves the detection of locations of copy number difference changes. Estimation concerns determining true copy number for each segment. Our approach not only gives posterior estimates for the parameters of interest, namely locations for copy number difference changes and true copy number estimates, but also useful confidence measures. In addition, our algorithm can segment multiple samples simultaneously, and infer both common and rare CNVs across individuals. Finally, for studies of CNVs in tumors, we incorporate an adjustment factor for signal attenuation due to tumor heterogeneity or normal contamination that can improve copy number estimates.
doi:10.1111/j.1541-0420.2009.01328.x
PMCID: PMC3766751  PMID: 19764955
Bayesian multiple change points; copy number variant; estimation; segmentation; signal attenuation; SNP microarrays
6.  The multiple-specificity landscape of modular peptide recognition domains 
Using large scale experimental datasets, the authors show how modular protein interaction domains such as PDZ, SH3 or WW domains, frequently display unexpected multiple binding specificity. The observed multiple specificity leads to new structural insights and accurately predicts new protein interactions.
Modular protein domains interacting with short linear peptides, such as PDZ, SH3 or WW domains, display a rich binding specificity with significant interplay (or correlation) between ligand residues.The binding specificity of these domains is more accurately described with a multiple specificity model.The multiple specificity reveals new structural insights and predicts new protein interactions.
Modular protein domains have a central role in the complex network of signaling pathways that governs cellular processes. Many of them, called peptide recognition domains, bind short linear regions in their target proteins, such as the well-known SH3 or PDZ domains. These domain–peptide interactions are the predominant form of protein interaction in signaling pathways.
Because of the relative simplicity of the interaction, their binding specificity is generally represented using a simple model, analogous to transcription factor binding: the domain binds a short stretch of amino acids and at each position some amino acids are preferred over other ones. Thus, for each position, a probability can be assigned to each amino acid and these probabilities are often grouped into a matrix called position weight matrix (PWM) or position-specific scoring matrix. Such a matrix can then be represented in a highly intuitive manner as a so-called sequence logo (see Figure 1).
A main shortcoming of this specificity model is that, although intuitive and interpretable, it inherently assumes that all residues in the peptide contribute independently to binding. On the basis of statistical analyses of large data sets of peptides binding to PDZ, SH3 and WW domains, we show that for most domains, this is not the case. Indeed, there is complex and highly significant interplay between the ligand residues. To overcome this issue, we develop a computational model that can both take into account such correlations and also preserve the advantages of PWMs, namely its straightforward interpretability.
Briefly, our method detects whether the domain is capable of binding its targets not only with a single specificity but also with multiple specificities. If so, it will determine all the relevant specificities (see Figure 1). This is accomplished by using a machine learning algorithm based on mixture models, and the results can be effectively visualized as multiple sequence logos. In other words, based on experimentally derived data sets of binding peptides, we determine for every domain, in addition to the known specificity, one or more new specificities. As such, we capture more real information, and our model performs better than previous models of binding specificity.
A crucial question is what these new specificities correspond to: are they simply mathematical artifacts coming out of some algorithm or do they represent something we can understand on a biophysical or structural level? Overall, the new specificities provide us with substantial new intuitive insight about the structural basis of binding for these domains. We can roughly identify two cases.
First, we have neighboring (or very close in sequence) amino acids in the ligand that show significant correlations. These usually correspond to amino acids whose side chains point in the same directions and often occupy the same physical space, and therefore can directly influence each other.
In other cases, we observe that multiple specificities found for a single domain are very different from each other. They correspond to different ways that the domain accommodates its binders. Often, conformational changes are required to switch from one binding mode to another. In almost all cases, only one canonical binding mode was previously known, and our analysis enables us to predict several interesting non-canonical ones. Specifically, we discuss one example in detail in Figure 5. In a PDZ domain of DLG1, we identify a novel binding specificity that differs from the canonical one by the presence of an additional tryptophan at the C terminus of the ligand. From a structural point of view, this would require a flexible loop to move out of the way to accommodate this rather large side chain. We find evidence of this predicted new binding mode based on both existing crystal structures and structural modeling.
Finally, our model of binding specificity leads to predictions of many new and previously unknown protein interactions. We validate a number of these using the membrane yeast two-hybrid approach.
In summary, we show here that multiple specificity is a general and underappreciated phenomenon for modular peptide recognition domains and that it leads to substantial new insight into the basis of protein interactions.
Modular protein interaction domains form the building blocks of eukaryotic signaling pathways. Many of them, known as peptide recognition domains, mediate protein interactions by recognizing short, linear amino acid stretches on the surface of their cognate partners with high specificity. Residues in these stretches are usually assumed to contribute independently to binding, which has led to a simplified understanding of protein interactions. Conversely, we observe in large binding peptide data sets that different residue positions display highly significant correlations for many domains in three distinct families (PDZ, SH3 and WW). These correlation patterns reveal a widespread occurrence of multiple binding specificities and give novel structural insights into protein interactions. For example, we predict a new binding mode of PDZ domains and structurally rationalize it for DLG1 PDZ1. We show that multiple specificity more accurately predicts protein interactions and experimentally validate some of the predictions for the human proteins DLG1 and SCRIB. Overall, our results reveal a rich specificity landscape in peptide recognition domains, suggesting new ways of encoding specificity in protein interaction networks.
doi:10.1038/msb.2011.18
PMCID: PMC3097085  PMID: 21525870
binding specificity; peptide recognition domains; PDZ; phage display; residue correlations
7.  Assessing the Significance of Conserved Genomic Aberrations Using High Resolution Genomic Microarrays 
PLoS Genetics  2007;3(8):e143.
Genomic aberrations recurrent in a particular cancer type can be important prognostic markers for tumor progression. Typically in early tumorigenesis, cells incur a breakdown of the DNA replication machinery that results in an accumulation of genomic aberrations in the form of duplications, deletions, translocations, and other genomic alterations. Microarray methods allow for finer mapping of these aberrations than has previously been possible; however, data processing and analysis methods have not taken full advantage of this higher resolution. Attention has primarily been given to analysis on the single sample level, where multiple adjacent probes are necessarily used as replicates for the local region containing their target sequences. However, regions of concordant aberration can be short enough to be detected by only one, or very few, array elements. We describe a method called Multiple Sample Analysis for assessing the significance of concordant genomic aberrations across multiple experiments that does not require a-priori definition of aberration calls for each sample. If there are multiple samples, representing a class, then by exploiting the replication across samples our method can detect concordant aberrations at much higher resolution than can be derived from current single sample approaches. Additionally, this method provides a meaningful approach to addressing population-based questions such as determining important regions for a cancer subtype of interest or determining regions of copy number variation in a population. Multiple Sample Analysis also provides single sample aberration calls in the locations of significant concordance, producing high resolution calls per sample, in concordant regions. The approach is demonstrated on a dataset representing a challenging but important resource: breast tumors that have been formalin-fixed, paraffin-embedded, archived, and subsequently UV-laser capture microdissected and hybridized to two-channel BAC arrays using an amplification protocol. We demonstrate the accurate detection on simulated data, and on real datasets involving known regions of aberration within subtypes of breast cancer at a resolution consistent with that of the array. Similarly, we apply our method to previously published datasets, including a 250K SNP array, and verify known results as well as detect novel regions of concordant aberration. The algorithm has been fully implemented and tested and is freely available as a Java application at http://www.cbil.upenn.edu/MSA.
Author Summary
Cancer is a genetic disease caused by genomic mutations that confer an increased ability to proliferate and survive in a specific environment. It is now known that many regions of genomic DNA are deleted or amplified in specific cancer types. These aberrations are believed to occur randomly in the genome. If these aberrations overlap more than would be expected by chance across individual occurrences of the cancer this suggests a selective pressure on this aberration. These conserved aberrations likely represent regions that are important for the development, progression, and survival of a specific cancer type in its environment. We present a method for identifying these conserved aberrations within a class of samples. The applications for this method include accurate high resolution mapping of aberrations characteristic of cancer subtypes as well as other genetic diseases and determination of conserved copy number variations in the population. With the use of high resolution microarray methods we have profiled different tumor types. We have been able to create high resolution profiles of conserved aberrations in specific cancer types. These conserved aberrations are prime targets for cancer therapies and many of these regions have already been used to develop effective cancer therapeutics.
doi:10.1371/journal.pgen.0030143
PMCID: PMC1950957  PMID: 17722985
8.  Markov Models for Inferring Copy Number Variations from Genotype Data on Illumina Platforms 
Human Heredity  2009;68(1):1-22.
Background/Aims
Illumina genotyping arrays provide information on DNA copy number. Current methodology for their analysis assumes linkage equilibrium across adjacent markers. This is unrealistic, given the markers high density, and can result in reduced specificity. Another limitation of current methods is that they cannot be directly applied to the analysis of multiple samples with the goal of detecting copy number polymorphisms and their association with traits of interest.
Methods
We propose a new Hidden Markov Model for Illumina genotype data, that takes into account linkage disequilibrium between adjacent loci. Our framework also allows for location specific deletion/duplication rates. When multiple samples are available, we describe a methodology for their analysis that simultaneously reconstructs the copy number states in each sample and identifies genomic locations with increased variability in copy number in the population. This approach can be extended to test association between copy number variants and a disease trait. Results and Conclusions: We show that taking into account linkage disequilibrium between adjacent markers can increase the specificity of a HMM in reconstructing copy number variants, especially single copy deletions. Our multisample approach is computationally practical and can increase the power of association studies.
doi:10.1159/000210445
PMCID: PMC2880724  PMID: 19339782
Linkage; Disequilibrium association
9.  Learning histopathological patterns 
Aims:
The aim was to demonstrate a method for automated image analysis of immunohistochemically stained tissue samples for extracting features that correlate with patient disease. We address the problem of quantifying tumor tissue and segmenting and counting cell nuclei.
Materials and Methods:
Our method utilizes a flexible segmentation method based on sparse coding trained from representative image samples. Nuclei counting is based on a nucleus model that takes size, shape, and nucleus probability into account. Nuclei clustering and overlays are resolved using a gray-weighted distance transform. We obtain a probability measure for pixels belonging to a nucleus from our segmentation procedure. Experiments are carried out on two sets of immunohistochemically stained images – one set based on the estrogen receptor (ER) and the other on antigen KI-67. For the nuclei separation we have selected 207 ER image samples from 58 tissue micro array-cores corresponding to 58 patients and 136 KI-67 image samples also from 58 cores. The images are hand-annotated by marking the center position of each nucleus. For the ER data we have a total of 1006 nuclei and for the KI-67 we have 796 nuclei. Segmentation performance was evaluated in terms of missing nuclei, falsely detected nuclei, and multiple detections. The proposed method is compared to state-of-the-art Bayesian classification.
Statistical analysis used:
The performance of the proposed method and a state-of-the-art algorithm including variations thereof is compared using the Wilcoxon rank sum test.
Results:
For both the ER experiment and the KI-67 experiment the proposed method exhibits lower error rates than the state-of-the-art method. Total error rates were 4.8 % and 7.7 % in the two experiments, corresponding to an average of 0.23 and 0.45 errors per image, respectively. The Wilcoxon rank sum tests show statistically significant improvements over the state-of-the-art method.
Conclusions:
We have demonstrated a method and obtained good performance compared to state-of-the-art nuclei separation. The segmentation procedure is simple, highly flexible, and we demonstrate how it, in addition to the nuclei separation, can perform precise segmentation of cancerous tissue. The complexity of the segmentation procedure is linear in the image size and the nuclei separation is linear in the number of nuclei. Additionally the method can be parallelized to obtain high-speed computations.
doi:10.4103/2153-3539.92033
PMCID: PMC3312718  PMID: 22811956
Computer-aided classification; digital histopathology images; flexible learning based segmentation; image segmentation
10.  Ranked retrieval of segmented nuclei for objective assessment of cancer gene repositioning 
BMC Bioinformatics  2012;13:232.
Background
Correct segmentation is critical to many applications within automated microscopy image analysis. Despite the availability of advanced segmentation algorithms, variations in cell morphology, sample preparation, and acquisition settings often lead to segmentation errors. This manuscript introduces a ranked-retrieval approach using logistic regression to automate selection of accurately segmented nuclei from a set of candidate segmentations. The methodology is validated on an application of spatial gene repositioning in breast cancer cell nuclei. Gene repositioning is analyzed in patient tissue sections by labeling sequences with fluorescence in situ hybridization (FISH), followed by measurement of the relative position of each gene from the nuclear center to the nuclear periphery. This technique requires hundreds of well-segmented nuclei per sample to achieve statistical significance. Although the tissue samples in this study contain a surplus of available nuclei, automatic identification of the well-segmented subset remains a challenging task.
Results
Logistic regression was applied to features extracted from candidate segmented nuclei, including nuclear shape, texture, context, and gene copy number, in order to rank objects according to the likelihood of being an accurately segmented nucleus. The method was demonstrated on a tissue microarray dataset of 43 breast cancer patients, comprising approximately 40,000 imaged nuclei in which the HES5 and FRA2 genes were labeled with FISH probes. Three trained reviewers independently classified nuclei into three classes of segmentation accuracy. In man vs. machine studies, the automated method outperformed the inter-observer agreement between reviewers, as measured by area under the receiver operating characteristic (ROC) curve. Robustness of gene position measurements to boundary inaccuracies was demonstrated by comparing 1086 manually and automatically segmented nuclei. Pearson correlation coefficients between the gene position measurements were above 0.9 (p < 0.05). A preliminary experiment was conducted to validate the ranked retrieval in a test to detect cancer. Independent manual measurement of gene positions agreed with automatic results in 21 out of 26 statistical comparisons against a pooled normal (benign) gene position distribution.
Conclusions
Accurate segmentation is necessary to automate quantitative image analysis for applications such as gene repositioning. However, due to heterogeneity within images and across different applications, no segmentation algorithm provides a satisfactory solution. Automated assessment of segmentations by ranked retrieval is capable of reducing or even eliminating the need to select segmented objects by hand and represents a significant improvement over binary classification. The method can be extended to other high-throughput applications requiring accurate detection of cells or nuclei across a range of biomedical applications.
doi:10.1186/1471-2105-13-232
PMCID: PMC3484015  PMID: 22971117
11.  Copy number variation analysis based on AluScan sequences 
Background
AluScan combines inter-Alu PCR using multiple Alu-based primers with opposite orientations and next-generation sequencing to capture a huge number of Alu-proximal genomic sequences for investigation. Its requirement of only sub-microgram quantities of DNA facilitates the examination of large numbers of samples. However, the special features of AluScan data rendered difficult the calling of copy number variation (CNV) directly using the calling algorithms designed for whole genome sequencing (WGS) or exome sequencing.
Results
In this study, an AluScanCNV package has been assembled for efficient CNV calling from AluScan sequencing data employing a Geary-Hinkley transformation (GHT) of read-depth ratios between either paired test-control samples, or between test samples and a reference template constructed from reference samples, to call the localized CNVs, followed by use of a GISTIC-like algorithm to identify recurrent CNVs and circular binary segmentation (CBS) to reveal large extended CNVs. To evaluate the utility of CNVs called from AluScan data, the AluScans from 23 non-cancer and 38 cancer genomes were analyzed in this study. The glioma samples analyzed yielded the familiar extended copy-number losses on chromosomes 1p and 9. Also, the recurrent somatic CNVs identified from liver cancer samples were similar to those reported for liver cancer WGS with respect to a striking enrichment of copy-number gains in chromosomes 1q and 8q. When localized or recurrent CNV-features capable of distinguishing between liver and non-liver cancer samples were selected by correlation-based machine learning, a highly accurate separation of the liver and non-liver cancer classes was attained.
Conclusions
The results obtained from non-cancer and cancerous tissues indicated that the AluScanCNV package can be employed to call localized, recurrent and extended CNVs from AluScan sequences. Moreover, both the localized and recurrent CNVs identified by this method could be subjected to machine-learning selection to yield distinguishing CNV-features that were capable of separating between liver cancers and other types of cancers. Since the method is applicable to any human DNA sample with or without the availability of a paired control, it can also be employed to analyze the constitutional CNVs of individuals.
Electronic supplementary material
The online version of this article (doi:10.1186/s13336-014-0015-z) contains supplementary material, which is available to authorized users.
doi:10.1186/s13336-014-0015-z
PMCID: PMC4273479  PMID: 25558350
AluScan sequencing; CNV calling; Cancer classification; Machine learning
12.  Simple binary segmentation frameworks for identifying variation in DNA copy number 
BMC Bioinformatics  2012;13:277.
Background
Variation in DNA copy number, due to gains and losses of chromosome segments, is common. A first step for analyzing DNA copy number data is to identify amplified or deleted regions in individuals. To locate such regions, we propose a circular binary segmentation procedure, which is based on a sequence of nested hypothesis tests, each using the Bayesian information criterion.
Results
Our procedure is convenient for analyzing DNA copy number in two general situations: (1) when using data from multiple sources and (2) when using cohort analysis of multiple patients suffering from the same type of cancer. In the first case, data from multiple sources such as different platforms, labs, or preprocessing methods are used to study variation in copy number in the same individual. Combining these sources provides a higher resolution, which leads to a more detailed genome-wide survey of the individual. In this case, we provide a simple statistical framework to derive a consensus molecular signature. In the framework, the multiple sequences from various sources are integrated into a single sequence, and then the proposed segmentation procedure is applied to this sequence to detect aberrant regions. In the second case, cohort analysis of multiple patients is carried out to derive overall molecular signatures for the cohort. For this case, we provide another simple statistical framework in which data across multiple profiles is standardized before segmentation. The proposed segmentation procedure is then applied to the standardized profiles one at a time to detect aberrant regions. Any such regions that are common across two or more profiles are probably real and may play important roles in the cancer pathogenesis process.
Conclusions
The main advantages of the proposed procedure are flexibility and simplicity.
doi:10.1186/1471-2105-13-277
PMCID: PMC3571941  PMID: 23107320
Bayesian information criterion; Circular binary segmentation; Consensus molecular signature; Overall molecular signature; Variation in DNA copy number
13.  A model of yeast cell-cycle regulation based on multisite phosphorylation 
Multisite phosphorylation of CDK target proteins provides the requisite nonlinearity for cell cycle modeling using elementary reaction mechanisms.Stochastic simulations, based on Gillespie's algorithm and using realistic numbers of protein and mRNA molecules, compare favorably with single-cell measurements in budding yeast.The role of transcription–translation coupling is critical in the robust operation of protein regulatory networks in yeast cells.
Progression through the eukaryotic cell cycle is governed by the activation and inactivation of a family of cyclin-dependent kinases (CDKs) and auxiliary proteins that regulate CDK activities (Morgan, 2007). The many components of this protein regulatory network are interconnected by positive and negative feedback loops that create bistable switches and transient pulses (Tyson and Novak, 2008). The network must ensure that cell-cycle events proceed in the correct order, that cell division is balanced with respect to cell growth, and that any problems encountered (in replicating the genome or partitioning chromosomes to daughter cells) are corrected before the cell proceeds to the next phase of the cycle. The network must operate robustly in the context of unavoidable molecular fluctuations in a yeast-sized cell. With a volume of only 5×10−14 l, a yeast cell contains one copy of the gene for each component of the network, a handful of mRNA transcripts of each gene, and a few hundreds to thousands of protein molecules carrying out each gene's function. How large are the molecular fluctuations implied by these numbers, and what effects do they have on the functioning of the cell-cycle control system?
To answer these questions, we have built a new model (Figure 1) of the CDK regulatory network in budding yeast, based on the fact that the targets of CDK activity are typically phosphorylated on multiple sites. The activity of each target protein depends on how many sites are phosphorylated. The target proteins feedback on CDK activity by controlling cyclin synthesis (SBF's role) and degradation (Cdh1's role) and by releasing a CDK-counteracting phosphatase (Cdc14). Every reaction in Figure 1 can be described by a mass-action rate law, with an accompanying rate constant that must be estimated from experimental data. As the transcription and translation of mRNA molecules have major effects on fluctuating numbers of protein molecules (Pedraza and Paulsson, 2008), we have included mRNA transcripts for each protein in the model.
To create a deterministic model, the rate laws are combined, according to standard principles of chemical kinetics, into a set of 60 differential equations that govern the temporal dynamics of the control system. In the stochastic version of the model, the rate law for each reaction determines the probability per unit time that a particular reaction occurs, and we use Gillespie's stochastic simulation algorithm (Gillespie, 1976) to compute possible temporal sequences of reaction events. Accurate stochastic simulations require knowledge of the expected numbers of mRNA and protein molecules in a single yeast cell. Fortunately, these numbers are available from several sources (Ghaemmaghami et al, 2003; Zenklusen et al, 2008). Although the experimental estimates are not always in good agreement with each other, they are sufficiently reliable to populate a stochastic model with realistic numbers of molecules.
By simulating thousands of cells (as in Figure 5), we can build up representative samples for computing the mean and s.d. of any measurable cell-cycle property (e.g. interdivision time, size at division, duration of G1 phase). The excellent fit of simulated statistics to observations of cell-cycle variability is documented in the main text and Supplementary Information.
Of particular interest to us are observations of Di Talia et al (2007) of the timing of a crucial G1 event (export of Whi5 protein from the nucleus) in a population of budding yeast cells growing at a specific growth rate α=ln2/(mass-doubling time). Whi5 export is a consequence of Whi5 phosphorylation, and it occurs simultaneously with the release (activation) of SBF (see Figure 1). Using fluorescently labeled Whi5, Di Talia et al could easily measure (in individual yeast cells) the time, T1, from cell birth to the abrupt loss of Whi5 from the nucleus. Correlating T1 to the size of the cell at birth, Vbirth, they found that, for a sample of daughter cells, αT1 versus ln(Vbirth) could be fit with two straight lines of slope −0.7 and −0.3. Our simulation of this experiment (Figure 7 of the main text) compares favorably with Figure 3d and e in Di Talia et al (2007).
The major sources of noise in our model (and in protein regulatory networks in yeast cells, in general) are related to gene transcription and the small number of unique mRNA transcripts. As each mRNA molecule may instruct the synthesis of dozens of protein molecules, the coefficient of variation of molecular fluctuations at the protein level (CVP) may be dominated by fluctuations at the mRNA level, as expressed in the formula (Pedraza and Paulsson, 2008) where NM, NP denote the number of mRNA and protein molecules, respectively, and ρ=τM/τP is the ratio of half-lives of mRNA and protein molecules. For a yeast cell, typical values of NM and NP are 8 and 800, respectively (Ghaemmaghami et al, 2003; Zenklusen et al, 2008). If ρ=1, then CVP≈25%. Such large fluctuations in protein levels are inconsistent with the observed variability of size and age at division in yeast cells, as shown in the simplified cell-cycle model of Kar et al (2009) and as we have confirmed with our more realistic model. The size of these fluctuations can be reduced to a more acceptable level by assuming a shorter half-life for mRNA (say, ρ=0.1).
There must be some mechanisms whereby yeast cells lessen the protein fluctuations implied by transcription–translation coupling. Following Pedraza and Paulsson (2008), we suggest that mRNA gestation and senescence may resolve this problem. Equation (3) is based on a simple, one-stage, birth–death model of mRNA turnover. In Supplementary Appendix 1, we show that a model of mRNA processing, with 10 stages each of mRNA gestation and senescence, gives reasonable fluctuations at the protein level (CVP≈5%), even if the effective half-life of mRNA is 10 min. A one-stage model with τM=1 min gives comparable fluctuations (CVP≈5%). In the main text, we use a simple birth–death model of mRNA turnover with an ‘effective' half-life of 1 min, in order to limit the computational complexity of the full cell-cycle model.
In order for the cell's genome to be passed intact from one generation to the next, the events of the cell cycle (DNA replication, mitosis, cell division) must be executed in the correct order, despite the considerable molecular noise inherent in any protein-based regulatory system residing in the small confines of a eukaryotic cell. To assess the effects of molecular fluctuations on cell-cycle progression in budding yeast cells, we have constructed a new model of the regulation of Cln- and Clb-dependent kinases, based on multisite phosphorylation of their target proteins and on positive and negative feedback loops involving the kinases themselves. To account for the significant role of noise in the transcription and translation steps of gene expression, the model includes mRNAs as well as proteins. The model equations are simulated deterministically and stochastically to reveal the bistable switching behavior on which proper cell-cycle progression depends and to show that this behavior is robust to the level of molecular noise expected in yeast-sized cells (∼50 fL volume). The model gives a quantitatively accurate account of the variability observed in the G1-S transition in budding yeast, which is governed by an underlying sizer+timer control system.
doi:10.1038/msb.2010.55
PMCID: PMC2947364  PMID: 20739927
bistability; cell-cycle variability; size control; stochastic model; transcription–translation coupling
14.  Inferring changepoint times of medial temporal lobe morphometric change in preclinical Alzheimer's disease 
NeuroImage : Clinical  2014;5:178-187.
This paper uses diffeomorphometry methods to quantify the order in which statistically significant morphometric change occurs in three medial temporal lobe regions, the amygdala, entorhinal cortex (ERC), and hippocampus among subjects with symptomatic and preclinical Alzheimer's disease (AD). Magnetic resonance imaging scans were examined in subjects who were cognitively normal at baseline, some of whom subsequently developed clinical symptoms of AD. The images were mapped to a common template, using shape-based diffeomorphometry. The multidimensional shape markers indexed through the temporal lobe structures were modeled using a changepoint model with explicit parameters, specifying the number of years preceding clinical symptom onset. Our model assumes that the atrophy rate of a considered brain structure increases years before detectable symptoms.
The results demonstrate that the atrophy changepoint in the ERC occurs first, indicating significant change 8–10 years prior to onset, followed by the hippocampus, 2–4 years prior to onset, followed by the amygdala, 3 years prior to onset. The ERC is significant bilaterally, in both our local and global measures, with estimates of ERC surface area loss of 2.4% (left side) and 1.6% (right side) annually. The same changepoint model for ERC volume gives 3.0% and 2.7% on the left and right sides, respectively. Understanding the order in which changes in the brain occur during preclinical AD may assist in the design of intervention trials aimed at slowing the evolution of the disease.
Highlights
•We use diffeomorphometry to quantify the order in which statistically significant morphometric change occurs in three medial temporal lobe regions, the amygdala, entorhinal cortex (ERC), and hippocampus among subjects with symptomatic and preclinical Alzheimer's disease (AD).•We introduce a model on anatomical shape change in which changepoint is inferred, taking place some period of time before cognitive onset of AD.•The analysis uses a dataset arising from the BIOCARD study, in which all subjects were cognitively normal at baseline, some of whom subsequently developed clinical symptoms of AD.•The results demonstrate that the atrophy changepoint in the ERC occurs first, indicating significant change 8-10 years prior to onset, followed by hippocampus, 2-4 years prior to onset, followed by amygdala, 3 years prior to onset.•The ERC is significant bilaterally, in both our local and global measures, with estimates of ERC surface area loss of 2.4% (left side) and 1.6% (right side) annually.•Understanding the order in which changes in the brain occur during preclinical AD may assist in the design of intervention trials aimed at slowing the evolution of the disease.
doi:10.1016/j.nicl.2014.04.009
PMCID: PMC4110355  PMID: 25101236
AD, Alzheimer's disease; MCI, mild cognitive impairment; ERC, entorhinal cortex; NIH, Clinical Center of the National Institutes of Health; NIA, National Institute on Aging; NIMH, National Institute for Mental Health; GPB, Geriatric Psychiatry Branch; SPGR, spoiled gradient echo; CDR, clinical dementia rating; FWER, family-wise error rate; ROI-LDDMM, region-of-interest large deformation diffeomorphic metric mapping; RSS, residual sum of squares; MMSE, mini-mental state exam; diffeomorphometry, study of shape using a metric on the diffeomorphic connections between structures
15.  Human metabolic profiles are stably controlled by genetic and environmental variation 
A comprehensive variation map of the human metabolome identifies genetic and stable-environmental sources as major drivers of metabolite concentrations. The data suggest that sample sizes of a few thousand are sufficient to detect metabolite biomarkers predictive of disease.
We designed a longitudinal twin study to characterize the genetic, stable-environmental, and longitudinally fluctuating influences on metabolite concentrations in two human biofluids—urine and plasma—focusing specifically on the representative subset of metabolites detectable by 1H nuclear magnetic resonance (1H NMR) spectroscopy.We identified widespread genetic and stable-environmental influences on the (urine and plasma) metabolomes, with (30 and 42%) attributable on average to familial sources, and (47 and 60%) attributable to longitudinally stable sources.Ten of the metabolites annotated in the study are estimated to have >60% familial contribution to their variation in concentration.Our findings have implications for the design and interpretation of 1H NMR-based molecular epidemiology studies. On the basis of the stable component of variation quantified in the current paper, we specified a model of disease association under which we inferred that sample sizes of a few thousand should be sufficient to detect disease-predictive metabolite biomarkers.
Metabolites are small molecules involved in biochemical processes in living systems. Their concentration in biofluids, such as urine and plasma, can offer insights into the functional status of biological pathways within an organism, and reflect input from multiple levels of biological organization—genetic, epigenetic, transcriptomic, and proteomic—as well as from environmental and lifestyle factors. Metabolite levels have the potential to indicate a broad variety of deviations from the ‘normal' physiological state, such as those that accompany a disease, or an increased susceptibility to disease. A number of recent studies have demonstrated that metabolite concentrations can be used to diagnose disease states accurately. A more ambitious goal is to identify metabolite biomarkers that are predictive of future disease onset, providing the possibility of intervention in susceptible individuals.
If an extreme concentration of a metabolite is to serve as an indicator of disease status, it is usually important to know the distribution of metabolite levels among healthy individuals. It is also useful to characterize the sources of that observed variation in the healthy population. A proportion of that variation—the heritable component—is attributable to genetic differences between individuals, potentially at many genetic loci. An effective, molecular indicator of a heritable, complex disease is likely to have a substantive heritable component. Non-heritable biological variation in metabolite concentrations can arise from a variety of environmental influences, such as dietary intake, lifestyle choices, general physical condition, composition of gut microflora, and use of medication. Variation across a population in stable-environmental influences leads to long-term differences between individuals in their baseline metabolite levels. Dynamic environmental pressures lead to short-term fluctuations within an individual about their baseline level. A metabolite whose concentration changes substantially in response to short-term pressures is relatively unlikely to offer long-term prediction of disease. In summary, the potential suitability of a metabolite to predict disease is reflected by the relative contributions of heritable and stable/unstable-environmental factors to its variation in concentration across the healthy population.
Studies involving twins are an established technique for quantifying the heritable component of phenotypes in human populations. Monozygotic (MZ) twins share the same DNA genome-wide, while dizygotic (DZ) twins share approximately half their inherited DNA, as do ordinary siblings. By comparing the average extent of phenotypic concordance within MZ pairs to that within DZ pairs, it is possible to quantify the heritability of a trait, and also to quantify the familiality, which refers to the combination of heritable and common-environmental effects (i.e., environmental influences shared by twins in a pair). In addition to incorporating twins into the study design, it is useful to quantify the phenotype in some individuals at multiple time points. The longitudinal aspect of such a study allows environmental effects to be decomposed into those that affect the phenotype over the short term and those that exert stable influence.
For the current study, urine and blood samples were collected from a cohort of MZ and DZ twins, with some twins donating samples on two occasions several months apart. Samples were analysed by 1H nuclear magnetic resonance (1H NMR) spectroscopy—an untargeted, discovery-driven technique for quantifying metabolite concentrations in biological samples. The application of 1H NMR to a biological sample creates a spectrum, made up of multiple peaks, with each peak's size quantitatively representing the concentration of its corresponding hydrogen-containing metabolite.
In each biological sample in our study, we extracted a full set of peaks, and thereby quantified the concentrations of all common plasma and urine metabolites detectable by 1H NMR. We developed bespoke statistical methods to decompose the observed concentration variation at each metabolite peak into that originating from familial, individual-environmental, and unstable-environmental sources.
We quantified the variability landscape across all common metabolite peaks in the urine and plasma 1H NMR metabolomes. We annotated a subset of peaks with a total of 65 metabolites; the variance decompositions for these are shown in Figure 1. Ten metabolites' concentrations were estimated to have familial contributions in excess of 60%. The average proportion of stable variation across all extracted metabolite peaks was estimated to be 47% in the urine samples and 60% in the plasma samples; the average estimated familiality was 30% for urine and 42% for plasma. These results comprise the first quantitative variation map of the 1H NMR metabolome. The identification and quantification of substantive widespread stability provides support for the use of these biofluids in molecular epidemiology studies. On the basis of our findings, we performed power calculations for a hypothetical study searching for predictive disease biomarkers among 1H NMR-detectable urine and plasma metabolites. Our calculations suggest that sample sizes of 2000–5000 should allow reliable identification of disease-predictive metabolite concentrations explaining 5–10% of disease risk, while greater sample sizes of 5000–20 000 would be required to identify metabolite concentrations explaining 1–2% of disease risk.
1H Nuclear Magnetic Resonance spectroscopy (1H NMR) is increasingly used to measure metabolite concentrations in sets of biological samples for top-down systems biology and molecular epidemiology. For such purposes, knowledge of the sources of human variation in metabolite concentrations is valuable, but currently sparse. We conducted and analysed a study to create such a resource. In our unique design, identical and non-identical twin pairs donated plasma and urine samples longitudinally. We acquired 1H NMR spectra on the samples, and statistically decomposed variation in metabolite concentration into familial (genetic and common-environmental), individual-environmental, and longitudinally unstable components. We estimate that stable variation, comprising familial and individual-environmental factors, accounts on average for 60% (plasma) and 47% (urine) of biological variation in 1H NMR-detectable metabolite concentrations. Clinically predictive metabolic variation is likely nested within this stable component, so our results have implications for the effective design of biomarker-discovery studies. We provide a power-calculation method which reveals that sample sizes of a few thousand should offer sufficient statistical precision to detect 1H NMR-based biomarkers quantifying predisposition to disease.
doi:10.1038/msb.2011.57
PMCID: PMC3202796  PMID: 21878913
biomarker; 1H nuclear magnetic resonance spectroscopy; metabolome-wide association study; top-down systems biology; variance decomposition
16.  CGHweb: a tool for comparing DNA copy number segmentations from multiple algorithms 
Bioinformatics (Oxford, England)  2008;24(7):1014-1015.
Summary
Accurate estimation of DNA copy numbers from array comparative genomic hybridization (CGH) data is important for characterizing the cancer genome. An important part of this process is the segmentation of the log-ratios between the sample and control DNA along the chromosome into regions of different copy numbers. However, multiple algorithms are available in the literature for this procedure and the results can vary substantially among these. Thus, a visualization tool that can display the segmented profiles from a number of methods can be helpful to the biologist or the clinician to ascertain that a feature of interest did not arise as an artifact of the algorithm. Such a tool also allows the methodologist to easily contrast his method against others.
We developed a web-based tool that applies a number of popular algorithms to a single array CGH profile entered by the user. It generates a heatmap panel of the segmented profiles for each method as well as a consensus profile. The clickable heatmap can be moved along the chromosome and zoomed in or out. It also displays the time that each algorithm took and provides numerical values of the segmented profiles for download. The web interface calls algorithms written in the statistical language R. We encourage developers of new algorithms to submit their routines to be incorporated into the website.
Availability: http://compbio.med.harvard.edu/CGHweb
Contact: peter_park@harvard.edu
doi:10.1093/bioinformatics/btn067
PMCID: PMC2516369  PMID: 18296463
17.  Minority HIV-1 Drug Resistance Mutations Are Present in Antiretroviral Treatment–Naïve Populations and Associate with Reduced Treatment Efficacy 
PLoS Medicine  2008;5(7):e158.
Background
Transmitted HIV-1 drug resistance can compromise initial antiretroviral therapy (ART); therefore, its detection is important for patient management. The absence of drug-associated selection pressure in treatment-naïve persons can cause drug-resistant viruses to decline to levels undetectable by conventional bulk sequencing (minority drug-resistant variants). We used sensitive and simple tests to investigate evidence of transmitted drug resistance in antiretroviral drug-naïve persons and assess the clinical implications of minority drug-resistant variants.
Methods and Findings
We performed a cross-sectional analysis of transmitted HIV-1 drug resistance and a case-control study of the impact of minority drug resistance on treatment response. For the cross-sectional analysis, we examined viral RNA from newly diagnosed ART-naïve persons in the US and Canada who had no detectable (wild type, n = 205) or one or more resistance-related mutations (n = 303) by conventional sequencing. Eight validated real-time PCR-based assays were used to test for minority drug resistance mutations (protease L90M and reverse transcriptase M41L, K70R, K103N, Y181C, M184V, and T215F/Y) above naturally occurring frequencies. The sensitive real-time PCR testing identified one to three minority drug resistance mutation(s) in 34/205 (17%) newly diagnosed persons who had wild-type virus by conventional genotyping; four (2%) individuals had mutations associated with resistance to two drug classes. Among 30/303 (10%) samples with bulk genotype resistance mutations we found at least one minority variant with a different drug resistance mutation. For the case-control study, we assessed the impact of three treatment-relevant drug resistance mutations at baseline from a separate group of 316 previously ART-naïve persons with no evidence of drug resistance on bulk genotype testing who were placed on efavirenz-based regimens. We found that 7/95 (7%) persons who experienced virologic failure had minority drug resistance mutations at baseline; however, minority resistance was found in only 2/221 (0.9%) treatment successes (Fisher exact test, p = 0.0038).
Conclusions
These data suggest that a considerable proportion of transmitted HIV-1 drug resistance is undetected by conventional genotyping and that minority mutations can have clinical consequences. With no treatment history to help guide therapies for drug-naïve persons, the findings suggest an important role for sensitive baseline drug resistance testing.
Using real-time PCR to detect HIV resistance mutations present at low levels, Jeffrey Johnson and colleagues investigate prevalence and clinical implications of minority transmitted mutations.
Editors' Summary
Background
Since the mid-1990s, several powerful antiretroviral drug combinations have been developed that have greatly improved the prognosis of HIV infection. All antiretroviral therapy (ART) regimens combine drugs that act against HIV in different ways (so-called different drug classes). Multiple drugs are necessary because HIV continually accumulates random changes (mutations) in its genetic material (genome). Some of these mutations make HIV resistant to individual antiretroviral drugs, so a mixture of drugs is needed to keep the virus in check. However, the efficacy of ART (which itself selects for drug-resistant variants by giving them a growth advantage over drug-sensitive variants) is substantially reduced when these variants account for more than about 20% of the viruses in an infected person. This level of variant virus can be detected in blood samples with a technique called bulk sequencing. In North America and Europe, where ART has been widely used for many years, around 20% of HIV-infected people who have taken ART themselves develop this level of drug-resistant virus, which can be transmitted by the same routes as nonresistant HIV (typically unprotected sexual intercourse or needle sharing). In such cases, the person acquiring drug-resistant HIV may experience treatment failure when drugs later fail to work against the resistant virus. In these countries, therefore, resistance testing by bulk sequencing is done routinely before ART is initiated to decide which antiviral drugs are likely to be effective.
Why Was This Study Done?
Several years usually elapse between the time a person becomes infected with HIV and the time he or she starts ART. During this time, the absence of selection pressure from antiviral drugs means that transmitted drug-resistant variants tend to decline to levels undetectable by bulk sequencing. These “minority drug-resistant variants” can be detected using other more sensitive tests but it is not known what proportion of HIV-infected people who have never taken ART carry minority drug-resistant variants (the “prevalence” of these variants). It is also unknown whether the presence of minority drug-resistant variants reduces the success of ART. In this paper, the researchers first report a “cross-sectional” study in North America using a sensitive assay to determine the prevalence of minority drug-resistant viruses among HIV-infected people who had never received ART. They then investigate whether minority drug-resistant variants have any impact on the effectiveness of ART in a “case-control” study.
What Did the Researchers Do and Find?
In their cross-sectional study, the researchers used a highly sensitive test for detecting mutations (called a real-time PCR-based assay) to look for low levels of viruses carrying any of eight major drug-resistance mutations in people with newly diagnosed HIV infection who reported no prior treatment with ART. Seventeen percent of the people who had only wild-type (nonmutated) virus by bulk sequencing (205 participants) were found, in fact, to carry low levels of virus variants with 1–3 drug-resistance mutations; 2% of them carried viruses resistant to two different drug classes (called multi-drug resistance). Among the people with resistance mutations detected by bulk sequencing (303 participants), 10% had at least one additional minority drug-resistant variant, often a viral variant that was resistant to a drug class different from that detected by bulk sequencing. In the case-control study, the researchers used their sensitive assays to measure the levels of viruses containing any of the three most common drug resistance mutations likely to affect viral responses to the antiretroviral drugs efavirenz and lamivudine in 316 people just before they started their first HIV treatment, which included these drugs. Of people for whom ART failed, 7% were infected with minority drug-resistant virus variants at baseline compared with only 0.9% of people for whom ART worked; this difference was statistically significant.
What Do These Findings Mean?
The findings of the cross-sectional study indicate that conventional bulk sequencing fails to detect a large proportion of transmitted HIV drug resistance and suggest that the transmission of drug-resistant variants from infectious ART-experienced people to ART-naïve individuals might not be uncommon. The findings of the case-control study suggest that the minority drug-resistant HIV variants may have clinical consequences. That is, the presence of such variants in individuals who have not previously taken ART may reduce the efficacy of some ART regimens. However, the number of participants meeting the criteria for analysis in the cross-sectional study was limited, and the association between minority resistance and treatment failure may have been influenced by other factors. Taken together, these findings suggest that, to ensure that first-line ART is as effective as possible, greater efforts should be made to prevent HIV transmission, whether from ART-experienced or ART-naive people. However, because data on minority drug-resistant virus are limited, more studies— particularly with recent populations—are needed before testing for these variants can be considered appropriate in the clinical management of newly diagnosed HIV infection.
Additional Information.
Please access these Web sites via the online version of this summary at http://dx.doi.org/10.1371/journal.pmed.0050158.
This study is further discussed in a PLoS Medicine Perspective by Steven G. Deeks
Information is available from the US National Institute of Allergy and Infectious Diseases on HIV infection and AIDS
HIV InSite has comprehensive information on all aspects of HIV/AIDS, including links to fact sheets (in English, French, and Spanish) about antiretrovirals and information on genetic testing for HIV drug resistance
NAM, a UK registered charity, provides information about all aspects of HIVand AIDS, including fact sheets on types of HIV drug, drug resistance, and resistance tests (in English, Spanish, French, Portuguese, and Russian)
The US Centers for Disease Control and Prevention provides information on HIV/AIDS and on treatment (in English and Spanish)
doi:10.1371/journal.pmed.0050158
PMCID: PMC2488194  PMID: 18666824
18.  Polymorphisms, Mutations, and Amplification of the EGFR Gene in Non-Small Cell Lung Cancers 
PLoS Medicine  2007;4(4):e125.
Background
The epidermal growth factor receptor (EGFR) gene is the prototype member of the type I receptor tyrosine kinase (TK) family and plays a pivotal role in cell proliferation and differentiation. There are three well described polymorphisms that are associated with increased protein production in experimental systems: a polymorphic dinucleotide repeat (CA simple sequence repeat 1 [CA-SSR1]) in intron one (lower number of repeats) and two single nucleotide polymorphisms (SNPs) in the promoter region, −216 (G/T or T/T) and −191 (C/A or A/A). The objective of this study was to examine distributions of these three polymorphisms and their relationships to each other and to EGFR gene mutations and allelic imbalance (AI) in non-small cell lung cancers.
Methods and Findings
We examined the frequencies of the three polymorphisms of EGFR in 556 resected lung cancers and corresponding non-malignant lung tissues from 336 East Asians, 213 individuals of Northern European descent, and seven of other ethnicities. We also studied the EGFR gene in 93 corresponding non-malignant lung tissue samples from European-descent patients from Italy and in peripheral blood mononuclear cells from 250 normal healthy US individuals enrolled in epidemiological studies including individuals of European descent, African–Americans, and Mexican–Americans. We sequenced the four exons (18–21) of the TK domain known to harbor activating mutations in tumors and examined the status of the CA-SSR1 alleles (presence of heterozygosity, repeat number of the alleles, and relative amplification of one allele) and allele-specific amplification of mutant tumors as determined by a standardized semiautomated method of microsatellite analysis. Variant forms of SNP −216 (G/T or T/T) and SNP −191 (C/A or A/A) (associated with higher protein production in experimental systems) were less frequent in East Asians than in individuals of other ethnicities (p < 0.001). Both alleles of CA-SSR1 were significantly longer in East Asians than in individuals of other ethnicities (p < 0.001). Expression studies using bronchial epithelial cultures demonstrated a trend towards increased mRNA expression in cultures having the variant SNP −216 G/T or T/T genotypes. Monoallelic amplification of the CA-SSR1 locus was present in 30.6% of the informative cases and occurred more often in individuals of East Asian ethnicity. AI was present in 44.4% (95% confidence interval: 34.1%–54.7%) of mutant tumors compared with 25.9% (20.6%–31.2%) of wild-type tumors (p = 0.002). The shorter allele in tumors with AI in East Asian individuals was selectively amplified (shorter allele dominant) more often in mutant tumors (75.0%, 61.6%–88.4%) than in wild-type tumors (43.5%, 31.8%–55.2%, p = 0.003). In addition, there was a strong positive association between AI ratios of CA-SSR1 alleles and AI of mutant alleles.
Conclusions
The three polymorphisms associated with increased EGFR protein production (shorter CA-SSR1 length and variant forms of SNPs −216 and −191) were found to be rare in East Asians as compared to other ethnicities, suggesting that the cells of East Asians may make relatively less intrinsic EGFR protein. Interestingly, especially in tumors from patients of East Asian ethnicity, EGFR mutations were found to favor the shorter allele of CA-SSR1, and selective amplification of the shorter allele of CA-SSR1 occurred frequently in tumors harboring a mutation. These distinct molecular events targeting the same allele would both be predicted to result in greater EGFR protein production and/or activity. Our findings may help explain to some of the ethnic differences observed in mutational frequencies and responses to TK inhibitors.
Masaharu Nomura and colleagues examine the distribution ofEGFR polymorphisms in different populations and find differences that might explain different responses to tyrosine kinase inhibitors in lung cancer patients.
Editors' Summary
Background.
Most cases of lung cancer—the leading cause of cancer deaths worldwide—are “non-small cell lung cancer” (NSCLC), which has a very low cure rate. Recently, however, “targeted” therapies have brought new hope to patients with NSCLC. Like all cancers, NSCLC occurs when cells begin to divide uncontrollably because of changes (mutations) in their genetic material. Chemotherapy drugs treat cancer by killing these rapidly dividing cells, but, because some normal tissues are sensitive to these agents, it is hard to kill the cancer completely without causing serious side effects. Targeted therapies specifically attack the changes in cancer cells that allow them to divide uncontrollably, so it might be possible to kill the cancer cells selectively without damaging normal tissues. Epidermal growth factor receptor (EGRF) was one of the first molecules for which a targeted therapy was developed. In normal cells, messenger proteins bind to EGFR and activate its “tyrosine kinase,” an enzyme that sticks phosphate groups on tyrosine (an amino acid) in other proteins. These proteins then tell the cell to divide. Alterations to this signaling system drive the uncontrolled growth of some cancers, including NSCLC.
Why Was This Study Done?
Molecules that inhibit the tyrosine kinase activity of EGFR (for example, gefitinib) dramatically shrink some NSCLCs, particularly those in East Asian patients. Tumors shrunk by tyrosine kinase inhibitors (TKIs) often (but not always) have mutations in EGFR's tyrosine kinase. However, not all tumors with these mutations respond to TKIs, and other genetic changes—for example, amplification (multiple copies) of the EGFR gene—also affect tumor responses to TKIs. It would be useful to know which genetic changes predict these responses when planning treatments for NSCLC and to understand why the frequency of these changes varies between ethnic groups. In this study, the researchers have examined three polymorphisms—differences in DNA sequences that occur between individuals—in the EGFR gene in people with and without NSCLC. In addition, they have looked for associations between these polymorphisms, which are present in every cell of the body, and the EGFR gene mutations and allelic imbalances (genes occur in pairs but amplification or loss of one copy, or allele, often causes allelic imbalance in tumors) that occur in NSCLCs.
What Did the Researchers Do and Find?
The researchers measured how often three EGFR polymorphisms (the length of a repeat sequence called CA-SSR1, and two single nucleotide variations [SNPs])—all of which probably affect how much protein is made from the EGFR gene—occurred in normal tissue and NSCLC tissue from East Asians and individuals of European descent. They also looked for mutations in the EGFR tyrosine kinase and allelic imbalance in the tumors, and then determined which genetic variations and alterations tended to occur together in people with the same ethnicity. Among many associations, the researchers found that shorter alleles of CA-SSR1 and the minor forms of the two SNPs occurred less often in East Asians than in individuals of European descent. They also confirmed that EGFR kinase mutations were more common in NSCLCs in East Asians than in European-descent individuals. Furthermore, mutations occurred more often in tumors with allelic imbalance, and in tumors where there was allelic imbalance and an EGFR mutation, the mutant allele was amplified more often than the wild-type allele.
What Do These Findings Mean?
The researchers use these associations between gene variants and tumor-associated alterations to propose a model to explain the ethnic differences in mutational frequencies and responses to TKIs seen in NSCLC. They suggest that because of the polymorphisms in the EGFR gene commonly seen in East Asians, people from this ethnic group make less EGFR protein than people from other ethnic groups. This would explain why, if a threshold level of EGFR is needed to drive cells towards malignancy, East Asians have a high frequency of amplified EGFR tyrosine kinase mutations in their tumors—mutation followed by amplification would be needed to activate EGFR signaling. This model, though speculative, helps to explain some clinical findings, such as the frequency of EGFR mutations and of TKI sensitivity in NSCLCs in East Asians. Further studies of this type in different ethnic groups and in different tumors, as well as with other genes for which targeted therapies are available, should help oncologists provide personalized cancer therapies for their patients.
Additional Information.
Please access these Web sites via the online version of this summary at http://dx.doi.org/10.1371/journal.pmed.0040125.
US National Cancer Institute information on lung cancer and on cancer treatment for patients and professionals
MedlinePlus encyclopedia entries on NSCLC
Cancer Research UK information for patients about all aspects of lung cancer, including treatment with TKIs
Wikipedia pages on lung cancer, EGFR, and gefitinib (note that Wikipedia is a free online encyclopedia that anyone can edit)
doi:10.1371/journal.pmed.0040125
PMCID: PMC1876407  PMID: 17455987
19.  Category-Specific Comparison of Univariate Alerting Methods for Biosurveillance Decision Support 
Objective
For a multi-source decision support application, we sought to match univariate alerting algorithms to surveillance data types to optimize detection performance.
Introduction
Temporal alerting algorithms commonly used in syndromic surveillance systems are often adjusted for data features such as cyclic behavior but are subject to overfitting or misspecification errors when applied indiscriminately.
In a project for the Armed Forces Health Surveillance Center to enable multivariate decision support, we obtained 4.5 years of out-patient, prescription and laboratory test records from all US military treatment facilities. A proof-of-concept project phase produced 16 events with multiple evidence corroboration for comparison of alerting algorithms for detection performance.
We used the representative streams from each data source to compare sensitivity of 6 algorithms to injected spikes, and we used all data streams from 16 known events to compare them for detection timeliness.
Methods
The six methods compared were: Holt-Winters generalized exponential smoothing method (1)automated choice between daily methods, regression and an exponential weighted moving average (2)adaptive daily Shewhart-type chartadaptive one-sided daily CUSUMEWMA applied to 7-day means with a trend correction; and7-day temporal scan statistic
Sensitivity testing: We conducted comparative sensitivity testing for categories of time series with similar scales and seasonal behavior. We added multiples of the standard deviation of each time series as single-day injects in separate algorithm runs. For each candidate method, we then used as a sensitivity measure the proportion of these runs for which the output of each algorithm was below alerting thresholds estimated empirically for each algorithm using simulated data streams. We identified the algorithm(s) whose sensitivity was most consistently high for each data category.
For each syndromic query applied to each data source (outpatient, lab test orders, and prescriptions), 502 authentic time series were derived, one for each reporting treatment facility. Data categories were selected in order to group time series with similar expected algorithm performance: Median > 100 < Median ≤ 10Median = 0Lag 7 Autocorrelation Coefficient ≥ 0.2Lag 7 Autocorrelation Coefficient < 0.2
Timeliness testing: For the timeliness testing, we avoided artificiality of simulated signals by measuring alerting detection delays in the 16 corroborated outbreaks. The multiple time series from these events gave a total of 141 time series with outbreak intervals for timeliness testing.
The following measures were computed to quantify timeliness of detection: Median Detection Delay – median number of days to detect the outbreak.Penalized Mean Detection Delay –mean number of days to detect the outbreak with outbreak misses penalized as 1 day plus the maximum detection time.
Results
Based on the injection results, the Holt-Winters algorithm was most sensitive among time series with positive medians. The adaptive CUSUM and the Shewhart methods were most sensitive for data streams with median zero. Table 1 provides timeliness results using the 141 outbreak-associated streams on sparse (Median=0) and non-sparse data categories.
[Insert table #1 here]
The gray shading in the table 1 indicates methods with shortest detection delays for sparse and non-sparse data streams. The Holt-Winters method was again superior for non-sparse data. For data with median=0, the adaptive CUSUM was superior for a daily false alarm probability of 0.01, but the Shewhart method was timelier for more liberal thresholds.
Conclusions
Both kinds of detection performance analysis showed the method based on Holt-Winters exponential smoothing superior on non-sparse time series with day-of-week effects. The adaptive CUSUM and She-whart methods proved optimal on sparse data and data without weekly patterns.
PMCID: PMC3692938
biosurveillance; timeliness; detection; alerting methods; sensitivity
20.  Gene Copy-Number Polymorphism Caused by Retrotransposition in Humans 
PLoS Genetics  2013;9(1):e1003242.
The era of whole-genome sequencing has revealed that gene copy-number changes caused by duplication and deletion events have important evolutionary, functional, and phenotypic consequences. Recent studies have therefore focused on revealing the extent of variation in copy-number within natural populations of humans and other species. These studies have found a large number of copy-number variants (CNVs) in humans, many of which have been shown to have clinical or evolutionary importance. For the most part, these studies have failed to detect an important class of gene copy-number polymorphism: gene duplications caused by retrotransposition, which result in a new intron-less copy of the parental gene being inserted into a random location in the genome. Here we describe a computational approach leveraging next-generation sequence data to detect gene copy-number variants caused by retrotransposition (retroCNVs), and we report the first genome-wide analysis of these variants in humans. We find that retroCNVs account for a substantial fraction of gene copy-number differences between any two individuals. Moreover, we show that these variants may often result in expressed chimeric transcripts, underscoring their potential for the evolution of novel gene functions. By locating the insertion sites of these duplicates, we are able to show that retroCNVs have had an important role in recent human adaptation, and we also uncover evidence that positive selection may currently be driving multiple retroCNVs toward fixation. Together these findings imply that retroCNVs are an especially important class of polymorphism, and that future studies of copy-number variation should search for these variants in order to illuminate their potential evolutionary and functional relevance.
Author Summary
Recent studies of human genetic variation have revealed that, in addition to differing at single nucleotide polymorphisms, individuals differ in copy-number at many regions of the genome. These copy-number variants (CNVs) are caused by duplication or deletion events and often affect functional sequences such as genes. Efforts to reveal the functional impact of CNVs have identified many variants increasing the risk of various disorders, and some that are adaptive. However, these studies mostly fail to detect gene duplications caused by retrotransposition, in which an mRNA transcript is reverse-transcribed and reinserted into the genome, yielding a new intron-less gene copy. Here we describe a method leveraging next-generation sequence data to accurately detect gene copy-number variants caused by retrotransposition, or retroCNVs, and apply this method to hundreds of whole-genome sequences from three different human subpopulations. We find that these variants account for a substantial number of gene copy-number differences between individuals, and that gene retrotransposition may often result in both deleterious and beneficial mutations. Indeed, we present evidence that two of these new gene duplications may be adaptive. These results imply that retroCNVs are an especially important class of CNV and should be included in future studies of human copy-number variation.
doi:10.1371/journal.pgen.1003242
PMCID: PMC3554589  PMID: 23359205
21.  A Mathematical Methodology for Determining the Temporal Order of Pathway Alterations Arising during Gliomagenesis 
PLoS Computational Biology  2012;8(1):e1002337.
Human cancer is caused by the accumulation of genetic alterations in cells. Of special importance are changes that occur early during malignant transformation because they may result in oncogene addiction and thus represent promising targets for therapeutic intervention. We have previously described a computational approach, called Retracing the Evolutionary Steps in Cancer (RESIC), to determine the temporal sequence of genetic alterations during tumorigenesis from cross-sectional genomic data of tumors at their fully transformed stage. Since alterations within a set of genes belonging to a particular signaling pathway may have similar or equivalent effects, we applied a pathway-based systems biology approach to the RESIC methodology. This method was used to determine whether alterations of specific pathways develop early or late during malignant transformation. When applied to primary glioblastoma (GBM) copy number data from The Cancer Genome Atlas (TCGA) project, RESIC identified a temporal order of pathway alterations consistent with the order of events in secondary GBMs. We then further subdivided the samples into the four main GBM subtypes and determined the relative contributions of each subtype to the overall results: we found that the overall ordering applied for the proneural subtype but differed for mesenchymal samples. The temporal sequence of events could not be identified for neural and classical subtypes, possibly due to a limited number of samples. Moreover, for samples of the proneural subtype, we detected two distinct temporal sequences of events: (i) RAS pathway activation was followed by TP53 inactivation and finally PI3K2 activation, and (ii) RAS activation preceded only AKT activation. This extension of the RESIC methodology provides an evolutionary mathematical approach to identify the temporal sequence of pathway changes driving tumorigenesis and may be useful in guiding the understanding of signaling rearrangements in cancer development.
Author Summary
Cancer is a deadly disease that develops through the accumulation of genetic changes over time. Many biological models do not incorporate this temporal aspect of tumor formation and progression, in part due to the difficulty of determining the sequence of events through biological experimentation for most cancer types. We previously developed a computational algorithm with which we can quickly and cost-effectively determine the order in which mutations arise in the tumor even when large numbers of mutations are considered. In this paper, we extended our method to incorporate biological knowledge of the common pathways by which cancer progresses. We applied these techniques to primary glioblastoma, the most common form of brain cancer. We found that when all samples are taken into account, a temporal sequence of pathway events emerges; however, different subtypes of glioblastoma vary in their temporal sequence of events. This algorithm can also be easily applied to other cancer types as clinical data becomes available, showing the benefit of computational and mathematical tools in cancer research. Using temporal information, cancer biologists will be able to develop more accurate animal models of tumor formation and learn more about how mutations interact in time, thus leading to better treatments for cancer.
doi:10.1371/journal.pcbi.1002337
PMCID: PMC3252265  PMID: 22241976
22.  HaplotypeCN: Copy Number Haplotype Inference with Hidden Markov Model and Localized Haplotype Clustering 
PLoS ONE  2014;9(5):e96841.
Copy number variation (CNV) has been reported to be associated with disease and various cancers. Hence, identifying the accurate position and the type of CNV is currently a critical issue. There are many tools targeting on detecting CNV regions, constructing haplotype phases on CNV regions, or estimating the numerical copy numbers. However, none of them can do all of the three tasks at the same time. This paper presents a method based on Hidden Markov Model to detect parent specific copy number change on both chromosomes with signals from SNP arrays. A haplotype tree is constructed with dynamic branch merging to model the transition of the copy number status of the two alleles assessed at each SNP locus. The emission models are constructed for the genotypes formed with the two haplotypes. The proposed method can provide the segmentation points of the CNV regions as well as the haplotype phasing for the allelic status on each chromosome. The estimated copy numbers are provided as fractional numbers, which can accommodate the somatic mutation in cancer specimens that usually consist of heterogeneous cell populations. The algorithm is evaluated on simulated data and the previously published regions of CNV of the 270 HapMap individuals. The results were compared with five popular methods: PennCNV, genoCN, COKGEN, QuantiSNP and cnvHap. The application on oral cancer samples demonstrates how the proposed method can facilitate clinical association studies. The proposed algorithm exhibits comparable sensitivity of the CNV regions to the best algorithm in our genome-wide study and demonstrates the highest detection rate in SNP dense regions. In addition, we provide better haplotype phasing accuracy than similar approaches. The clinical association carried out with our fractional estimate of copy numbers in the cancer samples provides better detection power than that with integer copy number states.
doi:10.1371/journal.pone.0096841
PMCID: PMC4029584  PMID: 24849202
23.  Near-Native Protein Loop Sampling Using Nonparametric Density Estimation Accommodating Sparcity 
PLoS Computational Biology  2011;7(10):e1002234.
Unlike the core structural elements of a protein like regular secondary structure, template based modeling (TBM) has difficulty with loop regions due to their variability in sequence and structure as well as the sparse sampling from a limited number of homologous templates. We present a novel, knowledge-based method for loop sampling that leverages homologous torsion angle information to estimate a continuous joint backbone dihedral angle density at each loop position. The φ,ψ distributions are estimated via a Dirichlet process mixture of hidden Markov models (DPM-HMM). Models are quickly generated based on samples from these distributions and were enriched using an end-to-end distance filter. The performance of the DPM-HMM method was evaluated against a diverse test set in a leave-one-out approach. Candidates as low as 0.45 Å RMSD and with a worst case of 3.66 Å were produced. For the canonical loops like the immunoglobulin complementarity-determining regions (mean RMSD <2.0 Å), the DPM-HMM method performs as well or better than the best templates, demonstrating that our automated method recaptures these canonical loops without inclusion of any IgG specific terms or manual intervention. In cases with poor or few good templates (mean RMSD >7.0 Å), this sampling method produces a population of loop structures to around 3.66 Å for loops up to 17 residues. In a direct test of sampling to the Loopy algorithm, our method demonstrates the ability to sample nearer native structures for both the canonical CDRH1 and non-canonical CDRH3 loops. Lastly, in the realistic test conditions of the CASP9 experiment, successful application of DPM-HMM for 90 loops from 45 TBM targets shows the general applicability of our sampling method in loop modeling problem. These results demonstrate that our DPM-HMM produces an advantage by consistently sampling near native loop structure. The software used in this analysis is available for download at http://www.stat.tamu.edu/~dahl/software/cortorgles/.
Author Summary
A protein's structure consists of elements of regular secondary structure connected by less regular stretches of loop segments. The irregularity of the loop structure makes loop modeling quite challenging. More accurate sampling of these loop conformations has a direct impact on protein modeling, design, function classification, as well as protein interactions. A method has been developed that extends a more comprehensive knowledge-based approach to producing models of the loop regions of protein structure. Most physical models cannot adequately sample the large conformational space, while the more discrete knowledge based libraries are conformationally limited. To address both of these problems, we introduce a novel statistical method that produces a continuous yet weighted estimation of loop conformational space from a discrete library of structures by using a Dirichlet process mixture of hidden Markov models (DPM-HMM). Applied to loop structure sampling, the results of a number of tests demonstrate that our approach quickly generates large numbers of candidates with near native loop conformations. Most significantly, in the cases where the template sampling is sparse and/or far from native conformations, the DPM-HMM method samples close to the native space and produces a population of accurate loop structures.
doi:10.1371/journal.pcbi.1002234
PMCID: PMC3197639  PMID: 22028638
24.  Evolutionary Triplet Models of Structured RNA 
PLoS Computational Biology  2009;5(8):e1000483.
The reconstruction and synthesis of ancestral RNAs is a feasible goal for paleogenetics. This will require new bioinformatics methods, including a robust statistical framework for reconstructing histories of substitutions, indels and structural changes. We describe a “transducer composition” algorithm for extending pairwise probabilistic models of RNA structural evolution to models of multiple sequences related by a phylogenetic tree. This algorithm draws on formal models of computational linguistics as well as the 1985 protosequence algorithm of David Sankoff. The output of the composition algorithm is a multiple-sequence stochastic context-free grammar. We describe dynamic programming algorithms, which are robust to null cycles and empty bifurcations, for parsing this grammar. Example applications include structural alignment of non-coding RNAs, propagation of structural information from an experimentally-characterized sequence to its homologs, and inference of the ancestral structure of a set of diverged RNAs. We implemented the above algorithms for a simple model of pairwise RNA structural evolution; in particular, the algorithms for maximum likelihood (ML) alignment of three known RNA structures and a known phylogeny and inference of the common ancestral structure. We compared this ML algorithm to a variety of related, but simpler, techniques, including ML alignment algorithms for simpler models that omitted various aspects of the full model and also a posterior-decoding alignment algorithm for one of the simpler models. In our tests, incorporation of basepair structure was the most important factor for accurate alignment inference; appropriate use of posterior-decoding was next; and fine details of the model were least important. Posterior-decoding heuristics can be substantially faster than exact phylogenetic inference, so this motivates the use of sum-over-pairs heuristics where possible (and approximate sum-over-pairs). For more exact probabilistic inference, we discuss the use of transducer composition for ML (or MCMC) inference on phylogenies, including possible ways to make the core operations tractable.
Author Summary
A number of leading methods for bioinformatics analysis of structural RNAs use probabilistic grammars as models for pairs of homologous RNAs. We show that any such pairwise grammar can be extended to an entire phylogeny by treating the pairwise grammar as a machine (a “transducer”) that models a single ancestor-descendant relationship in the tree, transforming one RNA structure into another. In addition to phylogenetic enhancement of current applications, such as RNA genefinding, homology detection, alignment and secondary structure prediction, this should enable probabilistic phylogenetic reconstruction of RNA sequences that are ancestral to present-day genes. We describe statistical inference algorithms, software implementations, and a simulation-based comparison of three-taxon maximum likelihood alignment to several other methods for aligning three sibling RNAs. In the Discussion we consider how the three-taxon RNA alignment-reconstruction-folding algorithm, which is currently very computationally-expensive, might be made more efficient so that larger phylogenies could be considered.
doi:10.1371/journal.pcbi.1000483
PMCID: PMC2725318  PMID: 19714212
25.  Software comparison for evaluating genomic copy number variation for Affymetrix 6.0 SNP array platform 
BMC Bioinformatics  2011;12:220.
Background
Copy number data are routinely being extracted from genome-wide association study chips using a variety of software. We empirically evaluated and compared four freely-available software packages designed for Affymetrix SNP chips to estimate copy number: Affymetrix Power Tools (APT), Aroma.Affymetrix, PennCNV and CRLMM. Our evaluation used 1,418 GENOA samples that were genotyped on the Affymetrix Genome-Wide Human SNP Array 6.0. We compared bias and variance in the locus-level copy number data, the concordance amongst regions of copy number gains/deletions and the false-positive rate amongst deleted segments.
Results
APT had median locus-level copy numbers closest to a value of two, whereas PennCNV and Aroma.Affymetrix had the smallest variability associated with the median copy number. Of those evaluated, only PennCNV provides copy number specific quality-control metrics and identified 136 poor CNV samples. Regions of copy number variation (CNV) were detected using the hidden Markov models provided within PennCNV and CRLMM/VanillaIce. PennCNV detected more CNVs than CRLMM/VanillaIce; the median number of CNVs detected per sample was 39 and 30, respectively. PennCNV detected most of the regions that CRLMM/VanillaIce did as well as additional CNV regions. The median concordance between PennCNV and CRLMM/VanillaIce was 47.9% for duplications and 51.5% for deletions. The estimated false-positive rate associated with deletions was similar for PennCNV and CRLMM/VanillaIce.
Conclusions
If the objective is to perform statistical tests on the locus-level copy number data, our empirical results suggest that PennCNV or Aroma.Affymetrix is optimal. If the objective is to perform statistical tests on the summarized segmented data then PennCNV would be preferred over CRLMM/VanillaIce. Specifically, PennCNV allows the analyst to estimate locus-level copy number, perform segmentation and evaluate CNV-specific quality-control metrics within a single software package. PennCNV has relatively small bias, small variability and detects more regions while maintaining a similar estimated false-positive rate as CRLMM/VanillaIce. More generally, we advocate that software developers need to provide guidance with respect to evaluating and choosing optimal settings in order to obtain optimal results for an individual dataset. Until such guidance exists, we recommend trying multiple algorithms, evaluating concordance/discordance and subsequently consider the union of regions for downstream association tests.
doi:10.1186/1471-2105-12-220
PMCID: PMC3146450  PMID: 21627824

Results 1-25 (1119298)