Search tips
Search criteria

Results 1-14 (14)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
Document Types
1.  Calculating Sample Size Estimates for RNA Sequencing Data 
Journal of Computational Biology  2013;20(12):970-978.
Given the high technical reproducibility and orders of magnitude greater resolution than microarrays, next-generation sequencing of mRNA (RNA-Seq) is quickly becoming the de facto standard for measuring levels of gene expression in biological experiments. Two important questions must be taken into consideration when designing a particular experiment, namely, 1) how deep does one need to sequence? and, 2) how many biological replicates are necessary to observe a significant change in expression?
Based on the gene expression distributions from 127 RNA-Seq experiments, we find evidence that 91% ± 4% of all annotated genes are sequenced at a frequency of 0.1 times per million bases mapped, regardless of sample source. Based on this observation, and combining this information with other parameters such as biological variation and technical variation that we empirically estimate from our large datasets, we developed a model to estimate the statistical power needed to identify differentially expressed genes from RNA-Seq experiments.
Our results provide a needed reference for ensuring RNA-Seq gene expression studies are conducted with the optimally sample size, power, and sequencing depth. We also make available both R code and an Excel worksheet for investigators to calculate for their own experiments.
PMCID: PMC3842884  PMID: 23961961
2.  Exome sequencing reveals frequent deleterious germline variants in cancer susceptibility genes in women with invasive breast cancer undergoing neoadjuvant chemotherapy 
When sequencing blood and tumor samples to identify targetable somatic variants for cancer therapy, clinically relevant germline variants may be uncovered. We evaluated the prevalence of deleterious germline variants in cancer susceptibility genes in women with breast cancer referred for neoadjuvant chemotherapy and returned clinically actionable results to patients. Exome sequencing was performed on blood samples from women with invasive breast cancer referred for neoadjuvant chemotherapy. Germline variants within 142 hereditary cancer susceptibility genes were filtered and reviewed for pathogenicity. Return of results was offered to patients with deleterious variants in actionable genes if they were not aware of their result through clinical testing. 124 patients were enrolled (median age 51) with the following subtypes: triple negative (n = 43, 34.7 %), HER2+ (n = 37, 29.8 %), luminal B (n = 31, 25 %), and luminal A (n = 13, 10.5 %). Twenty-eight deleterious variants were identified in 26/124 (21.0 %) patients in the following genes: ATM (n = 3), BLM (n = 1), BRCA1 (n = 4), BRCA2 (n = 8), CHEK2 (n = 2), FANCA (n = 1), FANCI (n = 1), FANCL (n = 1), FANCM (n = 1), FH (n = 1), MLH3 (n = 1), MUTYH (n = 2), PALB2 (n = 1), and WRN (n = 1). 121/124 (97.6 %) patients consented to return of research results. Thirteen (10.5 %) had actionable variants, including four that were returned to patients and led to changes in medical management. Deleterious variants in cancer susceptibility genes are highly prevalent in patients with invasive breast cancer referred for neoadjuvant chemotherapy undergoing exome sequencing. Detection of these variants impacts medical management.
Electronic supplementary material
The online version of this article (doi:10.1007/s10549-015-3545-6) contains supplementary material, which is available to authorized users.
PMCID: PMC4559569  PMID: 26296701
Breast cancer; Neoadjuvant chemotherapy; High-risk breast cancer; Return of results; Exome sequencing; Germline mutation/pathogenic germline variant
3.  Transcriptomic and Immunohistochemical Profiling of SLC6A14 in Pancreatic Ductal Adenocarcinoma 
BioMed Research International  2015;2015:593572.
We used a target-centric strategy to identify transporter proteins upregulated in pancreatic ductal adenocarcinoma (PDAC) as potential targets for a functional imaging probe to complement existing anatomical imaging approaches. We performed transcriptomic profiling (microarray and RNASeq) on histologically confirmed primary PDAC tumors and normal pancreas tissue from 33 patients, including five patients whose tumors were not visible on computed tomography. Target expression was confirmed with immunohistochemistry on tissue microarrays from 94 PDAC patients. The best imaging target identified was SLC6A14 (a neutral and basic amino acid transporter). SLC6A14 was overexpressed at the transcriptional level in all patients and expressed at the protein level in 95% of PDAC tumors. Very little is known about the role of SLC6A14 in PDAC and our results demonstrate that this target merits further investigation as a candidate transporter for functional imaging of PDAC.
PMCID: PMC4461733  PMID: 26106611
4.  PANDA: pathway and annotation explorer for visualizing and interpreting gene-centric data 
PeerJ  2015;3:e970.
Objective. Bringing together genomics, transcriptomics, proteomics, and other -omics technologies is an important step towards developing highly personalized medicine. However, instrumentation has advances far beyond expectations and now we are able to generate data faster than it can be interpreted.
Materials and Methods. We have developed PANDA (Pathway AND Annotation) Explorer, a visualization tool that integrates gene-level annotation in the context of biological pathways to help interpret complex data from disparate sources. PANDA is a web-based application that displays data in the context of well-studied pathways like KEGG, BioCarta, and PharmGKB. PANDA represents data/annotations as icons in the graph while maintaining the other data elements (i.e., other columns for the table of annotations). Custom pathways from underrepresented diseases can be imported when existing data sources are inadequate. PANDA also allows sharing annotations among collaborators.
Results. In our first use case, we show how easy it is to view supplemental data from a manuscript in the context of a user’s own data. Another use-case is provided describing how PANDA was leveraged to design a treatment strategy from the somatic variants found in the tumor of a patient with metastatic sarcomatoid renal cell carcinoma.
Conclusion. PANDA facilitates the interpretation of gene-centric annotations by visually integrating this information with context of biological pathways. The application can be downloaded or used directly from our website:
PMCID: PMC4451017  PMID: 26038725
Pathway; Visualization; Genomics; User interface; Data integration; Variant interpretation; Annotation and pathway visualization
5.  TP53 mutations, tetraploidy and homologous recombination repair defects in early stage high-grade serous ovarian cancer 
Nucleic Acids Research  2015;43(14):6945-6958.
To determine early somatic changes in high-grade serous ovarian cancer (HGSOC), we performed whole genome sequencing on a rare collection of 16 low stage HGSOCs. The majority showed extensive structural alterations (one had an ultramutated profile), exhibited high levels of p53 immunoreactivity, and harboured a TP53 mutation, deletion or inactivation. BRCA1 and BRCA2 mutations were observed in two tumors, with nine showing evidence of a homologous recombination (HR) defect. Combined Analysis with The Cancer Genome Atlas (TCGA) indicated that low and late stage HGSOCs have similar mutation and copy number profiles. We also found evidence that deleterious TP53 mutations are the earliest events, followed by deletions or loss of heterozygosity (LOH) of chromosomes carrying TP53, BRCA1 or BRCA2. Inactivation of HR appears to be an early event, as 62.5% of tumours showed a LOH pattern suggestive of HR defects. Three tumours with the highest ploidy had little genome-wide LOH, yet one of these had a homozygous somatic frame-shift BRCA2 mutation, suggesting that some carcinomas begin as tetraploid then descend into diploidy accompanied by genome-wide LOH. Lastly, we found evidence that structural variants (SV) cluster in HGSOC, but are absent in one ultramutated tumor, providing insights into the pathogenesis of low stage HGSOC.
PMCID: PMC4538798  PMID: 25916844
6.  Somatic expression of ENRAGE is associated with obesity status among patients with clear cell renal cell carcinoma 
Carcinogenesis  2013;35(4):822-827.
An association between obesity and development of ccRCC has been established; however, the molecular mechanisms are unknown. We used a multistage design to identify and validate that overexpression of ENRAGE is an obesity-associated somatic alteration.
An association between obesity and development of clear cell renal cell carcinoma (ccRCC) has been established in the literature; however, there are limited data regarding the molecular mechanisms that underlie this association. Therefore, we used a multistage design to identify and validate genes that are associated with obesity-related ccRCC. We conducted a microarray study and compared gene expression between obese and non-obese subjects in ccRCC tumors and patient-matched normal kidney tissues. Analyses were stratified by smoking status and subsequently performed on the combined cohort. The primary objective was to identify genes where the fold change of ccRCC tumor expression between obese and non-obese subjects was different than the fold change in the patient-matched normal kidney tissue. Thus, we utilized a mixed model and evaluated the tissue type-by-obesity status interaction term. Targeted validation was performed using reverse transcription–polymerase chain reaction (RT–PCR) on an independent cohort. ENRAGE was identified in the microarray study and subsequently validated using RT–PCR to have a statistically significant tissue type-by-obesity status interaction. Specifically, although ENRAGE is similarly expressed across obese and non-obese subjects in normal tissue, it is upregulated in the patient-matched ccRCC tumor tissue. Additionally, ENRAGE is upregulated in tumors that are wild-type for the von Hippel Lindau gene and in tumors for subjects with poorer overall survival. In summary, we provide evidence that overexpression of ENRAGE in ccRCC tumor tissue is an obesity-associated somatic alteration. Upregulation of ENRAGE could lead to local, autocrine stimulation of the RAGE receptor and thus support cancer progression.
PMCID: PMC3977147  PMID: 24374825
7.  APOBEC3B upregulation and genomic mutation patterns in serous ovarian carcinoma 
Cancer research  2013;73(24):10.1158/0008-5472.CAN-13-1753.
Ovarian cancer is a clinically and molecularly heterogeneous disease. The driving forces behind this variability are unknown. Here we report wide variation in expression of the DNA cytosine deaminase APOBEC3B, with elevated expression in a majority of ovarian cancer cell lines (3 standard deviations above the mean of normal ovarian surface epithelial cells) and high grade primary ovarian cancers. APOBEC3B is active in the nucleus of several ovarian cancer cell lines and elicits a biochemical preference for deamination of cytosines in 5′TC dinucleotides. Importantly, examination of whole-genome sequence from 16 ovarian cancers reveals that APOBEC3B expression correlates with total mutation load as well as elevated levels of transversion mutations. In particular, high APOBEC3B expression correlates with C-to-A and C-to-G transversion mutations within 5′TC dinucleotide motifs in early-stage high grade serous ovarian cancer genomes, suggesting that APOBEC3B-catalyzed genomic uracil lesions are further processed by downstream DNA ‘repair’ enzymes including error-prone translesion polymerases. These data identify a potential role for APOBEC3B in serous ovarian cancer genomic instability.
PMCID: PMC3867573  PMID: 24154874
APOBEC3B; DNA cytosine deamination; genomic uracil; ovarian cancer; transversion mutations
8.  Genetic Alterations Associated With Progression From Pancreatic Intraepithelial Neoplasia to Invasive Pancreatic Tumor 
Gastroenterology  2013;145(5):1098-1109.e1.
Background & Aims
Increasing grade of pancreatic intraepithelial neoplasia (PanIN) has been associated with progression to pancreatic ductal adenocarcinoma (PDAC). However, the mechanisms that control progression from PanINs to PDAC are not well understood. We investigated the genetic alterations involved in this process.
Genomic DNA samples from laser-capture microdissected PDACs and adjacent PanIN2 and PanIN3 lesions from 10 patients with pancreatic cancer were analyzed by exome sequencing.
Similar numbers of somatic mutations were identified in PanINs and tumors, but the mutational load varied greatly among cases. Ten of the 15 isolated PanINs shared more than 50% of somatic mutations with associated tumors. Mutations common to tumors and clonally related PanIN2 and PanIN3 lesions were identified as genes that could promote carcinogenesis. KRAS and TP53 were frequently altered in PanINs and tumors, but few other recurrently modified genes were detected. Mutations in DNA damage response genes were prevalent in all samples. Genes that encode proteins involved in gap junctions, the actin cytoskeleton, the mitogen-activated protein kinase signaling pathway, axon guidance, and cell cycle regulation were among the earliest targets of mutagenesis in PanINs that progressed to PDAC.
Early-stage PanIN2 lesions appear to contain many of the somatic gene alterations required for PDAC development.
PMCID: PMC3926442  PMID: 23912084
pancreas; tumorigenesis; LCM; whole genome amplification
9.  PatternCNV: a versatile tool for detecting copy number changes from exome sequencing data 
Bioinformatics  2014;30(18):2678-2680.
Motivation: Exome sequencing (exome-seq) data, which are typically used for calling exonic mutations, have also been utilized in detecting DNA copy number variations (CNVs). Despite the existence of several CNV detection tools, there is still a great need for a sensitive and an accurate CNV-calling algorithm with built-in QC steps, and does not require a paired reference for each sample.
Results: We developed a novel method named PatternCNV, which (i) accounts for the read coverage variations between exons while leveraging the consistencies of this variability across different samples; (ii) reduces alignment BAM files to WIG format and therefore greatly accelerates computation; (iii) incorporates multiple QC measures designed to identify outlier samples and batch effects; and (iv) provides a variety of visualization options including chromosome, gene and exon-level views of CNVs, along with a tabular summarization of the exon-level CNVs. Compared with other CNV-calling algorithms using data from a lymphoma exome-seq study, PatternCNV has higher sensitivity and specificity.
Availability and implementation: The software for PatternCNV is implemented using Perl and R, and can be used in Mac or Linux environments. Software and user manual are available at, and R package at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4155258  PMID: 24876377
10.  The Biological Reference Repository (BioR): a rapid and flexible system for genomics annotation 
Bioinformatics  2014;30(13):1920-1922.
Motivation: The Biological Reference Repository (BioR) is a toolkit for annotating variants. BioR stores public and user-specific annotation sources in indexed JSON-encoded flat files (catalogs). The BioR toolkit provides the functionality to combine and retrieve annotation from these catalogs via the command-line interface. Several catalogs from commonly used annotation sources and instructions for creating user-specific catalogs are provided. Commands from the toolkit can be combined with other UNIX commands for advanced annotation processing. We also provide instructions for the development of custom annotation pipelines.
Availability and implementation: The package is implemented in Java and makes use of external tools written in Java and Perl. The toolkit can be executed on Mac OS X 10.5 and above or any Linux distribution. The BioR application, quickstart, and user guide documents and many biological examples are available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4071205  PMID: 24618464
11.  From Days to Hours: Reporting Clinically Actionable Variants from Whole Genome Sequencing 
PLoS ONE  2014;9(2):e86803.
As the cost of whole genome sequencing (WGS) decreases, clinical laboratories will be looking at broadly adopting this technology to screen for variants of clinical significance. To fully leverage this technology in a clinical setting, results need to be reported quickly, as the turnaround rate could potentially impact patient care. The latest sequencers can sequence a whole human genome in about 24 hours. However, depending on the computing infrastructure available, the processing of data can take several days, with the majority of computing time devoted to aligning reads to genomics regions that are to date not clinically interpretable. In an attempt to accelerate the reporting of clinically actionable variants, we have investigated the utility of a multi-step alignment algorithm focused on aligning reads and calling variants in genomic regions of clinical relevance prior to processing the remaining reads on the whole genome. This iterative workflow significantly accelerates the reporting of clinically actionable variants with no loss of accuracy when compared to genotypes obtained with the OMNI SNP platform or to variants detected with a standard workflow that combines Novoalign and GATK.
PMCID: PMC3914798  PMID: 24505267
12.  SoftSearch: Integration of Multiple Sequence Features to Identify Breakpoints of Structural Variations 
PLoS ONE  2013;8(12):e83356.
Structural variation (SV) represents a significant, yet poorly understood contribution to an individual’s genetic makeup. Advanced next-generation sequencing technologies are widely used to discover such variations, but there is no single detection tool that is considered a community standard. In an attempt to fulfil this need, we developed an algorithm, SoftSearch, for discovering structural variant breakpoints in Illumina paired-end next-generation sequencing data. SoftSearch combines multiple strategies for detecting SV including split-read, discordant read-pair, and unmated pairs. Co-localized split-reads and discordant read pairs are used to refine the breakpoints.
We developed and validated SoftSearch using real and synthetic datasets. SoftSearch’s key features are 1) not requiring secondary (or exhaustive primary) alignment, 2) portability into established sequencing workflows, and 3) is applicable to any DNA-sequencing experiment (e.g. whole genome, exome, custom capture, etc.). SoftSearch identifies breakpoints from a small number of soft-clipped bases from split reads and a few discordant read-pairs which on their own would not be sufficient to make an SV call.
We show that SoftSearch can identify more true SVs by combining multiple sequence features. SoftSearch was able to call clinically relevant SVs in the BRCA2 gene not reported by other tools while offering significantly improved overall performance.
PMCID: PMC3865185  PMID: 24358278
13.  Deep Sequence Analysis of Non-Small Cell Lung Cancer: Integrated Analysis of Gene Expression, Alternative Splicing, and Single Nucleotide Variations in Lung Adenocarcinomas with and without Oncogenic KRAS Mutations 
KRAS mutations are highly prevalent in non-small cell lung cancer (NSCLC), and tumors harboring these mutations tend to be aggressive and resistant to chemotherapy. We used next-generation sequencing technology to identify pathways that are specifically altered in lung tumors harboring a KRAS mutation. Paired-end RNA-sequencing of 15 primary lung adenocarcinoma tumors (8 harboring mutant KRAS and 7 with wild-type KRAS) were performed. Sequences were mapped to the human genome, and genomic features, including differentially expressed genes, alternate splicing isoforms and single nucleotide variants, were determined for tumors with and without KRAS mutation using a variety of computational methods. Network analysis was carried out on genes showing differential expression (374 genes), alternate splicing (259 genes), and SNV-related changes (65 genes) in NSCLC tumors harboring a KRAS mutation. Genes exhibiting two or more connections from the lung adenocarcinoma network were used to carry out integrated pathway analysis. The most significant signaling pathways identified through this analysis were the NFκB, ERK1/2, and AKT pathways. A 27 gene mutant KRAS-specific sub network was extracted based on gene–gene connections from the integrated network, and interrogated for druggable targets. Our results confirm previous evidence that mutant KRAS tumors exhibit activated NFκB, ERK1/2, and AKT pathways and may be preferentially sensitive to target therapeutics toward these pathways. In addition, our analysis indicates novel, previously unappreciated links between mutant KRAS and the TNFR and PPARγ signaling pathways, suggesting that targeted PPARγ antagonists and TNFR inhibitors may be useful therapeutic strategies for treatment of mutant KRAS lung tumors. Our study is the first to integrate genomic features from RNA-Seq data from NSCLC and to define a first draft genomic landscape model that is unique to tumors with oncogenic KRAS mutations.
PMCID: PMC3356053  PMID: 22655260
transcriptome sequencing; RNA-Seq; KRAS mutation; NSCLC; bioinformatics; network analysis; data integration and computational methods
14.  Microarray analysis of the in vivo sequence preferences of a minor groove binding drug 
BMC Genomics  2008;9:32.
Minor groove binding drugs (MGBDs) interact with DNA in a sequence-specific manner and can cause changes in gene expression at the level of transcription. They serve as valuable models for protein interactions with DNA and form an important class of antitumor, antiviral, antitrypanosomal and antibacterial drugs. There is a need to extend knowledge of the sequence requirements for MGBDs from in vitro DNA binding studies to living cells.
Here we describe the use of microarray analysis to discover yeast genes that are affected by treatment with the MGBD berenil, thereby allowing the investigation of its sequence requirements for binding in vivo. A novel approach to sequence analysis allowed us to address hypotheses about genes that were directly or indirectly affected by drug binding. The results show that the sequence features of A/T richness and heteropolymeric character discovered by in vitro berenil binding studies are found upstream of genes hypothesized to be directly affected by berenil but not upstream of those hypothesized to be indirectly affected or those shown to be unaffected.
The data support the conclusion that effects of berenil on gene expression in yeast cells can be explained by sequence patterns discovered by in vitro binding experiments. The results shed light on the sequence and structural rules by which berenil binds to DNA and affects the transcriptional regulation of genes and contribute generally to the development of MGBDs as tools for basic and applied research.
PMCID: PMC2254601  PMID: 18215295

Results 1-14 (14)