1.  Statistical Comparison Framework and Visualization Scheme for Ranking-Based Algorithms in High-Throughput Genome-Wide Studies 
Journal of Computational Biology  2009;16(4):565-577.
As a first step in analyzing high-throughput data in genome-wide studies, several algorithms are available to identify and prioritize candidates lists for downstream fine-mapping. The prioritized candidates could be differentially expressed genes, aberrations in comparative genomics hybridization studies, or single nucleotide polymorphisms (SNPs) in association studies. Different analysis algorithms are subject to various experimental artifacts and analytical features that lead to different candidate lists. However, little research has been carried out to theoretically quantify the consensus between different candidate lists and to compare the study specific accuracy of the analytical methods based on a known reference candidate list. Within the context of genome-wide studies, we propose a generic mathematical framework to statistically compare ranked lists of candidates from different algorithms with each other or, if available, with a reference candidate list. To cope with the growing need for intuitive visualization of high-throughput data in genome-wide studies, we describe a complementary customizable visualization tool. As a case study, we demonstrate application of our framework to the comparison and visualization of candidate lists generated in a DNA-pooling based genome-wide association study of CEPH data in the HapMap project, where prior knowledge from individual genotyping can be used to generate a true reference candidate list. The results provide a theoretical basis to compare the accuracy of various methods and to identify redundant methods, thus providing guidance for selecting the most suitable analysis method in genome-wide studies.
PMCID: PMC3148127  PMID: 19361328
genome-wide association studies; candidate lists
2.  Open-access synthetic spike-in mRNA-seq data for cancer gene fusions 
BMC Genomics  2014;15(1):824.
Oncogenic fusion genes underlie the mechanism of several common cancers. Next-generation sequencing based RNA-seq analyses have revealed an increasing number of recurrent fusions in a variety of cancers. However, absence of a publicly available gene-fusion focused RNA-seq data impedes comparative assessment and collaborative development of novel gene fusions detection algorithms. We have generated nine synthetic poly-adenylated RNA transcripts that correspond to previously reported oncogenic gene fusions. These synthetic RNAs were spiked at known molarity over a wide range into total RNA prior to construction of next-generation sequencing mRNA libraries to generate RNA-seq data.
Leveraging a priori knowledge about replicates and molarity of each synthetic fusion transcript, we demonstrate utility of this dataset to compare multiple gene fusion algorithms’ detection ability. In general, more fusions are detected at higher molarity, indicating that our constructs performed as expected. However, systematic detection differences are observed based on molarity or algorithm-specific characteristics. Fusion-sequence specific detection differences indicate that for applications where specific sequences are being investigated, additional constructs may be added to provide quantitative data that is specific for the sequence of interest.
To our knowledge, this is the first publicly available synthetic RNA-seq data that specifically leverages known cancer gene-fusions. The proposed method of designing multiple gene-fusion constructs over a wide range of molarity allows granular performance analyses of multiple fusion-detection algorithms. The community can leverage and augment this publicly available data to further collaborative development of analytical tools and performance assessment frameworks for gene fusions from next-generation sequencing data.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-824) contains supplementary material, which is available to authorized users.
PMCID: PMC4190330  PMID: 25266161
RNA-seq; Gene fusions; Cancer genomics
3.  Germline Mutations in HOXB13 and Prostate-Cancer Risk 
The New England journal of medicine  2012;366(2):141-149.
Family history is a significant risk factor for prostate cancer, although the molecular basis for this association is poorly understood. Linkage studies have implicated chromosome 17q21-22 as a possible location of a prostate-cancer susceptibility gene.
We screened more than 200 genes in the 17q21-22 region by sequencing germline DNA from 94 unrelated patients with prostate cancer from families selected for linkage to the candidate region. We tested family members, additional case subjects, and control subjects to characterize the frequency of the identified mutations.
Probands from four families were discovered to have a rare but recurrent mutation (G84E) in HOXB13 (rs138213197), a homeobox transcription factor gene that is important in prostate development. All 18 men with prostate cancer and available DNA in these four families carried the mutation. The carrier rate of the G84E mutation was increased by a factor of approximately 20 in 5083 unrelated subjects of European descent who had prostate cancer, with the mutation found in 72 subjects (1.4%), as compared with 1 in 1401 control subjects (0.1%) (P = 8.5×10−7). The mutation was significantly more common in men with early-onset, familial prostate cancer (3.1%) than in those with late-onset, nonfamilial prostate cancer (0.6%) (P = 2.0×10−6).
The novel HOXB13 G84E variant is associated with a significantly increased risk of hereditary prostate cancer. Although the variant accounts for a small fraction of all prostate cancers, this finding has implications for prostate-cancer risk assessment and may provide new mechanistic insights into this common cancer. (Funded by the National Institutes of Health and others.)
PMCID: PMC3779870  PMID: 22236224
4.  Paired Tumor and Normal Whole Genome Sequencing of Metastatic Olfactory Neuroblastoma 
PLoS ONE  2012;7(5):e37029.
Olfactory neuroblastoma (ONB) is a rare cancer of the sinonasal tract with little molecular characterization. We performed whole genome sequencing (WGS) on paired normal and tumor DNA from a patient with metastatic-ONB to identify the somatic alterations that might be drivers of tumorigenesis and/or metastatic progression.
Methodology/Principal Findings
Genomic DNA was isolated from fresh frozen tissue from a metastatic lesion and whole blood, followed by WGS at >30X depth, alignment and mapping, and mutation analyses. Sanger sequencing was used to confirm selected mutations. Sixty-two somatic short nucleotide variants (SNVs) and five deletions were identified inside coding regions, each causing a non-synonymous DNA sequence change. We selected seven SNVs and validated them by Sanger sequencing. In the metastatic ONB samples collected several months prior to WGS, all seven mutations were present. However, in the original surgical resection specimen (prior to evidence of metastatic disease), mutations in KDR, MYC, SIN3B, and NLRC4 genes were not present, suggesting that these were acquired with disease progression and/or as a result of post-treatment effects.
This work provides insight into the evolution of ONB cancer cells and provides a window into the more complex factors, including tumor clonality and multiple driver mutations.
PMCID: PMC3359355  PMID: 22649506
5.  GRM7 variants confer susceptibility to age-related hearing impairment 
Human Molecular Genetics  2008;18(4):785-796.
Age-related hearing impairment (ARHI), or presbycusis, is the most prevalent sensory impairment in the elderly. ARHI is a complex disease caused by an interaction between environmental and genetic factors. Here we describe the results of the first whole genome association study for ARHI. The study was performed using 846 cases and 846 controls selected from 3434 individuals collected by eight centers in six European countries. DNA pools for cases and controls were allelotyped on the Affymetrix 500K GeneChip® for each center separately. The 252 top-ranked single nucleotide polymorphisms (SNPs) identified in a non-Finnish European sample group (1332 samples) and the 177 top-ranked SNPs from a Finnish sample group (360 samples) were confirmed using individual genotyping. Subsequently, the 23 most interesting SNPs were individually genotyped in an independent European replication group (138 samples). This resulted in the identification of a highly significant and replicated SNP located in GRM7, the gene encoding metabotropic glutamate receptor type 7. Also in the Finnish sample group, two GRM7 SNPs were significant, albeit in a different region of the gene. As the Finnish are genetically distinct from the rest of the European population, this may be due to allelic heterogeneity. We performed histochemical studies in human and mouse and showed that mGluR7 is expressed in hair cells and in spiral ganglion cells of the inner ear. Together these data indicate that common alleles of GRM7 contribute to an individual's risk of developing ARHI, possibly through a mechanism of altered susceptibility to glutamate excitotoxicity.
PMCID: PMC2638831  PMID: 19047183
6.  Multimarker analysis and imputation of multiple platform pooling-based genome-wide association studies 
Bioinformatics  2008;24(17):1896-1902.
Summary: For many genome-wide association (GWA) studies individually genotyping one million or more SNPs provides a marginal increase in coverage at a substantial cost. Much of the information gained is redundant due to the correlation structure inherent in the human genome. Pooling-based GWA studies could benefit significantly by utilizing this redundancy to reduce noise, improve the accuracy of the observations and increase genomic coverage. We introduce a measure of correlation between individual genotyping and pooling, under the same framework that r2 provides a measure of linkage disequilibrium (LD) between pairs of SNPs. We then report a new non-haplotype multimarker multi-loci method that leverages the correlation structure between SNPs in the human genome to increase the efficacy of pooling-based GWA studies. We first give a theoretical framework and derivation of our multimarker method. Next, we evaluate simulations using this multimarker approach in comparison to single marker analysis. Finally, we experimentally evaluate our method using different pools of HapMap individuals on the Illumina 450S Duo, Illumina 550K and Affymetrix 5.0 platforms for a combined total of 1 333 631 SNPs. Our results show that use of multimarker analysis reduces noise specific to pooling-based studies, allows for efficient integration of multiple microarray platforms and provides more accurate measures of significance than single marker analysis. Additionally, this approach can be extended to allow for imputing the association significance for SNPs not directly observed using neighboring SNPs in LD. This multimarker method can now be used to cost-effectively complete pooling-based GWA studies with multiple platforms across over one million SNPs and to impute neighboring SNPs weighted for the loss of information due to pooling.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2732219  PMID: 18617537

