Search tips
Search criteria

Results 1-9 (9)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
1.  APOBEC3B upregulation and genomic mutation patterns in serous ovarian carcinoma 
Cancer research  2013;73(24):10.1158/0008-5472.CAN-13-1753.
Ovarian cancer is a clinically and molecularly heterogeneous disease. The driving forces behind this variability are unknown. Here we report wide variation in expression of the DNA cytosine deaminase APOBEC3B, with elevated expression in a majority of ovarian cancer cell lines (3 standard deviations above the mean of normal ovarian surface epithelial cells) and high grade primary ovarian cancers. APOBEC3B is active in the nucleus of several ovarian cancer cell lines and elicits a biochemical preference for deamination of cytosines in 5′TC dinucleotides. Importantly, examination of whole-genome sequence from 16 ovarian cancers reveals that APOBEC3B expression correlates with total mutation load as well as elevated levels of transversion mutations. In particular, high APOBEC3B expression correlates with C-to-A and C-to-G transversion mutations within 5′TC dinucleotide motifs in early-stage high grade serous ovarian cancer genomes, suggesting that APOBEC3B-catalyzed genomic uracil lesions are further processed by downstream DNA ‘repair’ enzymes including error-prone translesion polymerases. These data identify a potential role for APOBEC3B in serous ovarian cancer genomic instability.
PMCID: PMC3867573  PMID: 24154874
APOBEC3B; DNA cytosine deamination; genomic uracil; ovarian cancer; transversion mutations
2.  Genetic Alterations Associated With Progression From Pancreatic Intraepithelial Neoplasia to Invasive Pancreatic Tumor 
Gastroenterology  2013;145(5):1098-1109.e1.
Background & Aims
Increasing grade of pancreatic intraepithelial neoplasia (PanIN) has been associated with progression to pancreatic ductal adenocarcinoma (PDAC). However, the mechanisms that control progression from PanINs to PDAC are not well understood. We investigated the genetic alterations involved in this process.
Genomic DNA samples from laser-capture microdissected PDACs and adjacent PanIN2 and PanIN3 lesions from 10 patients with pancreatic cancer were analyzed by exome sequencing.
Similar numbers of somatic mutations were identified in PanINs and tumors, but the mutational load varied greatly among cases. Ten of the 15 isolated PanINs shared more than 50% of somatic mutations with associated tumors. Mutations common to tumors and clonally related PanIN2 and PanIN3 lesions were identified as genes that could promote carcinogenesis. KRAS and TP53 were frequently altered in PanINs and tumors, but few other recurrently modified genes were detected. Mutations in DNA damage response genes were prevalent in all samples. Genes that encode proteins involved in gap junctions, the actin cytoskeleton, the mitogen-activated protein kinase signaling pathway, axon guidance, and cell cycle regulation were among the earliest targets of mutagenesis in PanINs that progressed to PDAC.
Early-stage PanIN2 lesions appear to contain many of the somatic gene alterations required for PDAC development.
PMCID: PMC3926442  PMID: 23912084
pancreas; tumorigenesis; LCM; whole genome amplification
3.  Calculating Sample Size Estimates for RNA Sequencing Data 
Journal of Computational Biology  2013;20(12):970-978.
Given the high technical reproducibility and orders of magnitude greater resolution than microarrays, next-generation sequencing of mRNA (RNA-Seq) is quickly becoming the de facto standard for measuring levels of gene expression in biological experiments. Two important questions must be taken into consideration when designing a particular experiment, namely, 1) how deep does one need to sequence? and, 2) how many biological replicates are necessary to observe a significant change in expression?
Based on the gene expression distributions from 127 RNA-Seq experiments, we find evidence that 91% ± 4% of all annotated genes are sequenced at a frequency of 0.1 times per million bases mapped, regardless of sample source. Based on this observation, and combining this information with other parameters such as biological variation and technical variation that we empirically estimate from our large datasets, we developed a model to estimate the statistical power needed to identify differentially expressed genes from RNA-Seq experiments.
Our results provide a needed reference for ensuring RNA-Seq gene expression studies are conducted with the optimally sample size, power, and sequencing depth. We also make available both R code and an Excel worksheet for investigators to calculate for their own experiments.
PMCID: PMC3842884  PMID: 23961961
4.  PatternCNV: a versatile tool for detecting copy number changes from exome sequencing data 
Bioinformatics  2014;30(18):2678-2680.
Motivation: Exome sequencing (exome-seq) data, which are typically used for calling exonic mutations, have also been utilized in detecting DNA copy number variations (CNVs). Despite the existence of several CNV detection tools, there is still a great need for a sensitive and an accurate CNV-calling algorithm with built-in QC steps, and does not require a paired reference for each sample.
Results: We developed a novel method named PatternCNV, which (i) accounts for the read coverage variations between exons while leveraging the consistencies of this variability across different samples; (ii) reduces alignment BAM files to WIG format and therefore greatly accelerates computation; (iii) incorporates multiple QC measures designed to identify outlier samples and batch effects; and (iv) provides a variety of visualization options including chromosome, gene and exon-level views of CNVs, along with a tabular summarization of the exon-level CNVs. Compared with other CNV-calling algorithms using data from a lymphoma exome-seq study, PatternCNV has higher sensitivity and specificity.
Availability and implementation: The software for PatternCNV is implemented using Perl and R, and can be used in Mac or Linux environments. Software and user manual are available at, and R package at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4155258  PMID: 24876377
5.  The Biological Reference Repository (BioR): a rapid and flexible system for genomics annotation 
Bioinformatics  2014;30(13):1920-1922.
Motivation: The Biological Reference Repository (BioR) is a toolkit for annotating variants. BioR stores public and user-specific annotation sources in indexed JSON-encoded flat files (catalogs). The BioR toolkit provides the functionality to combine and retrieve annotation from these catalogs via the command-line interface. Several catalogs from commonly used annotation sources and instructions for creating user-specific catalogs are provided. Commands from the toolkit can be combined with other UNIX commands for advanced annotation processing. We also provide instructions for the development of custom annotation pipelines.
Availability and implementation: The package is implemented in Java and makes use of external tools written in Java and Perl. The toolkit can be executed on Mac OS X 10.5 and above or any Linux distribution. The BioR application, quickstart, and user guide documents and many biological examples are available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4071205  PMID: 24618464
6.  From Days to Hours: Reporting Clinically Actionable Variants from Whole Genome Sequencing 
PLoS ONE  2014;9(2):e86803.
As the cost of whole genome sequencing (WGS) decreases, clinical laboratories will be looking at broadly adopting this technology to screen for variants of clinical significance. To fully leverage this technology in a clinical setting, results need to be reported quickly, as the turnaround rate could potentially impact patient care. The latest sequencers can sequence a whole human genome in about 24 hours. However, depending on the computing infrastructure available, the processing of data can take several days, with the majority of computing time devoted to aligning reads to genomics regions that are to date not clinically interpretable. In an attempt to accelerate the reporting of clinically actionable variants, we have investigated the utility of a multi-step alignment algorithm focused on aligning reads and calling variants in genomic regions of clinical relevance prior to processing the remaining reads on the whole genome. This iterative workflow significantly accelerates the reporting of clinically actionable variants with no loss of accuracy when compared to genotypes obtained with the OMNI SNP platform or to variants detected with a standard workflow that combines Novoalign and GATK.
PMCID: PMC3914798  PMID: 24505267
7.  SoftSearch: Integration of Multiple Sequence Features to Identify Breakpoints of Structural Variations 
PLoS ONE  2013;8(12):e83356.
Structural variation (SV) represents a significant, yet poorly understood contribution to an individual’s genetic makeup. Advanced next-generation sequencing technologies are widely used to discover such variations, but there is no single detection tool that is considered a community standard. In an attempt to fulfil this need, we developed an algorithm, SoftSearch, for discovering structural variant breakpoints in Illumina paired-end next-generation sequencing data. SoftSearch combines multiple strategies for detecting SV including split-read, discordant read-pair, and unmated pairs. Co-localized split-reads and discordant read pairs are used to refine the breakpoints.
We developed and validated SoftSearch using real and synthetic datasets. SoftSearch’s key features are 1) not requiring secondary (or exhaustive primary) alignment, 2) portability into established sequencing workflows, and 3) is applicable to any DNA-sequencing experiment (e.g. whole genome, exome, custom capture, etc.). SoftSearch identifies breakpoints from a small number of soft-clipped bases from split reads and a few discordant read-pairs which on their own would not be sufficient to make an SV call.
We show that SoftSearch can identify more true SVs by combining multiple sequence features. SoftSearch was able to call clinically relevant SVs in the BRCA2 gene not reported by other tools while offering significantly improved overall performance.
PMCID: PMC3865185  PMID: 24358278
8.  Deep Sequence Analysis of Non-Small Cell Lung Cancer: Integrated Analysis of Gene Expression, Alternative Splicing, and Single Nucleotide Variations in Lung Adenocarcinomas with and without Oncogenic KRAS Mutations 
KRAS mutations are highly prevalent in non-small cell lung cancer (NSCLC), and tumors harboring these mutations tend to be aggressive and resistant to chemotherapy. We used next-generation sequencing technology to identify pathways that are specifically altered in lung tumors harboring a KRAS mutation. Paired-end RNA-sequencing of 15 primary lung adenocarcinoma tumors (8 harboring mutant KRAS and 7 with wild-type KRAS) were performed. Sequences were mapped to the human genome, and genomic features, including differentially expressed genes, alternate splicing isoforms and single nucleotide variants, were determined for tumors with and without KRAS mutation using a variety of computational methods. Network analysis was carried out on genes showing differential expression (374 genes), alternate splicing (259 genes), and SNV-related changes (65 genes) in NSCLC tumors harboring a KRAS mutation. Genes exhibiting two or more connections from the lung adenocarcinoma network were used to carry out integrated pathway analysis. The most significant signaling pathways identified through this analysis were the NFκB, ERK1/2, and AKT pathways. A 27 gene mutant KRAS-specific sub network was extracted based on gene–gene connections from the integrated network, and interrogated for druggable targets. Our results confirm previous evidence that mutant KRAS tumors exhibit activated NFκB, ERK1/2, and AKT pathways and may be preferentially sensitive to target therapeutics toward these pathways. In addition, our analysis indicates novel, previously unappreciated links between mutant KRAS and the TNFR and PPARγ signaling pathways, suggesting that targeted PPARγ antagonists and TNFR inhibitors may be useful therapeutic strategies for treatment of mutant KRAS lung tumors. Our study is the first to integrate genomic features from RNA-Seq data from NSCLC and to define a first draft genomic landscape model that is unique to tumors with oncogenic KRAS mutations.
PMCID: PMC3356053  PMID: 22655260
transcriptome sequencing; RNA-Seq; KRAS mutation; NSCLC; bioinformatics; network analysis; data integration and computational methods
9.  Microarray analysis of the in vivo sequence preferences of a minor groove binding drug 
BMC Genomics  2008;9:32.
Minor groove binding drugs (MGBDs) interact with DNA in a sequence-specific manner and can cause changes in gene expression at the level of transcription. They serve as valuable models for protein interactions with DNA and form an important class of antitumor, antiviral, antitrypanosomal and antibacterial drugs. There is a need to extend knowledge of the sequence requirements for MGBDs from in vitro DNA binding studies to living cells.
Here we describe the use of microarray analysis to discover yeast genes that are affected by treatment with the MGBD berenil, thereby allowing the investigation of its sequence requirements for binding in vivo. A novel approach to sequence analysis allowed us to address hypotheses about genes that were directly or indirectly affected by drug binding. The results show that the sequence features of A/T richness and heteropolymeric character discovered by in vitro berenil binding studies are found upstream of genes hypothesized to be directly affected by berenil but not upstream of those hypothesized to be indirectly affected or those shown to be unaffected.
The data support the conclusion that effects of berenil on gene expression in yeast cells can be explained by sequence patterns discovered by in vitro binding experiments. The results shed light on the sequence and structural rules by which berenil binds to DNA and affects the transcriptional regulation of genes and contribute generally to the development of MGBDs as tools for basic and applied research.
PMCID: PMC2254601  PMID: 18215295

Results 1-9 (9)