Search tips
Search criteria

Results 1-14 (14)

Clipboard (0)
Year of Publication
Document Types
1.  An international effort towards developing standards for best practices in analysis, interpretation and reporting of clinical genome sequencing results in the CLARITY Challenge 
Brownstein, Catherine A | Beggs, Alan H | Homer, Nils | Merriman, Barry | Yu, Timothy W | Flannery, Katherine C | DeChene, Elizabeth T | Towne, Meghan C | Savage, Sarah K | Price, Emily N | Holm, Ingrid A | Luquette, Lovelace J | Lyon, Elaine | Majzoub, Joseph | Neupert, Peter | McCallie Jr, David | Szolovits, Peter | Willard, Huntington F | Mendelsohn, Nancy J | Temme, Renee | Finkel, Richard S | Yum, Sabrina W | Medne, Livija | Sunyaev, Shamil R | Adzhubey, Ivan | Cassa, Christopher A | de Bakker, Paul IW | Duzkale, Hatice | Dworzyński, Piotr | Fairbrother, William | Francioli, Laurent | Funke, Birgit H | Giovanni, Monica A | Handsaker, Robert E | Lage, Kasper | Lebo, Matthew S | Lek, Monkol | Leshchiner, Ignaty | MacArthur, Daniel G | McLaughlin, Heather M | Murray, Michael F | Pers, Tune H | Polak, Paz P | Raychaudhuri, Soumya | Rehm, Heidi L | Soemedi, Rachel | Stitziel, Nathan O | Vestecka, Sara | Supper, Jochen | Gugenmus, Claudia | Klocke, Bernward | Hahn, Alexander | Schubach, Max | Menzel, Mortiz | Biskup, Saskia | Freisinger, Peter | Deng, Mario | Braun, Martin | Perner, Sven | Smith, Richard JH | Andorf, Janeen L | Huang, Jian | Ryckman, Kelli | Sheffield, Val C | Stone, Edwin M | Bair, Thomas | Black-Ziegelbein, E Ann | Braun, Terry A | Darbro, Benjamin | DeLuca, Adam P | Kolbe, Diana L | Scheetz, Todd E | Shearer, Aiden E | Sompallae, Rama | Wang, Kai | Bassuk, Alexander G | Edens, Erik | Mathews, Katherine | Moore, Steven A | Shchelochkov, Oleg A | Trapane, Pamela | Bossler, Aaron | Campbell, Colleen A | Heusel, Jonathan W | Kwitek, Anne | Maga, Tara | Panzer, Karin | Wassink, Thomas | Van Daele, Douglas | Azaiez, Hela | Booth, Kevin | Meyer, Nic | Segal, Michael M | Williams, Marc S | Tromp, Gerard | White, Peter | Corsmeier, Donald | Fitzgerald-Butt, Sara | Herman, Gail | Lamb-Thrush, Devon | McBride, Kim L | Newsom, David | Pierson, Christopher R | Rakowsky, Alexander T | Maver, Aleš | Lovrečić, Luca | Palandačić, Anja | Peterlin, Borut | Torkamani, Ali | Wedell, Anna | Huss, Mikael | Alexeyenko, Andrey | Lindvall, Jessica M | Magnusson, Måns | Nilsson, Daniel | Stranneheim, Henrik | Taylan, Fulya | Gilissen, Christian | Hoischen, Alexander | van Bon, Bregje | Yntema, Helger | Nelen, Marcel | Zhang, Weidong | Sager, Jason | Zhang, Lu | Blair, Kathryn | Kural, Deniz | Cariaso, Michael | Lennon, Greg G | Javed, Asif | Agrawal, Saloni | Ng, Pauline C | Sandhu, Komal S | Krishna, Shuba | Veeramachaneni, Vamsi | Isakov, Ofer | Halperin, Eran | Friedman, Eitan | Shomron, Noam | Glusman, Gustavo | Roach, Jared C | Caballero, Juan | Cox, Hannah C | Mauldin, Denise | Ament, Seth A | Rowen, Lee | Richards, Daniel R | Lucas, F Anthony San | Gonzalez-Garay, Manuel L | Caskey, C Thomas | Bai, Yu | Huang, Ying | Fang, Fang | Zhang, Yan | Wang, Zhengyuan | Barrera, Jorge | Garcia-Lobo, Juan M | González-Lamuño, Domingo | Llorca, Javier | Rodriguez, Maria C | Varela, Ignacio | Reese, Martin G | De La Vega, Francisco M | Kiruluta, Edward | Cargill, Michele | Hart, Reece K | Sorenson, Jon M | Lyon, Gholson J | Stevenson, David A | Bray, Bruce E | Moore, Barry M | Eilbeck, Karen | Yandell, Mark | Zhao, Hongyu | Hou, Lin | Chen, Xiaowei | Yan, Xiting | Chen, Mengjie | Li, Cong | Yang, Can | Gunel, Murat | Li, Peining | Kong, Yong | Alexander, Austin C | Albertyn, Zayed I | Boycott, Kym M | Bulman, Dennis E | Gordon, Paul MK | Innes, A Micheil | Knoppers, Bartha M | Majewski, Jacek | Marshall, Christian R | Parboosingh, Jillian S | Sawyer, Sarah L | Samuels, Mark E | Schwartzentruber, Jeremy | Kohane, Isaac S | Margulies, David M
Genome Biology  2014;15(3):R53.
There is tremendous potential for genome sequencing to improve clinical diagnosis and care once it becomes routinely accessible, but this will require formalizing research methods into clinical best practices in the areas of sequence data generation, analysis, interpretation and reporting. The CLARITY Challenge was designed to spur convergence in methods for diagnosing genetic disease starting from clinical case history and genome sequencing data. DNA samples were obtained from three families with heritable genetic disorders and genomic sequence data were donated by sequencing platform vendors. The challenge was to analyze and interpret these data with the goals of identifying disease-causing variants and reporting the findings in a clinically useful format. Participating contestant groups were solicited broadly, and an independent panel of judges evaluated their performance.
A total of 30 international groups were engaged. The entries reveal a general convergence of practices on most elements of the analysis and interpretation process. However, even given this commonality of approach, only two groups identified the consensus candidate variants in all disease cases, demonstrating a need for consistent fine-tuning of the generally accepted methods. There was greater diversity of the final clinical report content and in the patient consenting process, demonstrating that these areas require additional exploration and standardization.
The CLARITY Challenge provides a comprehensive assessment of current practices for using genome sequencing to diagnose and report genetic diseases. There is remarkable convergence in bioinformatic techniques, but medical interpretation and reporting are areas that require further development by many groups.
PMCID: PMC4073084  PMID: 24667040
Nature methods  2008;5(10):887-893.
We developed a generalized framework for multiplexed resequencing of targeted regions of the human genome on the Illumina Genome Analyzer using degenerate indexed DNA sequence barcodes ligated to fragmented DNA prior to sequencing. Using this method, the DNA of multiple HapMap individuals was simultaneously sequenced at several ENCODE (ENCyclopedia of DNA Elements) regions. We then evaluated the use of Bayes factors for discovering and genotyping polymorphisms from aligned sequenced reads. If we required that predicted polymorphisms be either previously identified by dbSNP or be visually evident upon reinspection of archived ENCODE traces, we observed a false-positive rate of 11.3% using strict thresholds (Ks>1,000) for predicting variants and 69.6% for lax thresholds (Ks>10). Conversely, false-negative rates ranged from 10.8% to 90.8%, with those at stricter cut-offs occurring at lower coverage (< 10 aligned reads). These results suggest that >90% of genetic variants are discoverable using multiplexed sequencing provided sufficient coverage at the polymorphic base.
PMCID: PMC3171277  PMID: 18794863
3.  A survey of sequence alignment algorithms for next-generation sequencing 
Briefings in Bioinformatics  2010;11(5):473-483.
Rapidly evolving sequencing technologies produce data on an unparalleled scale. A central challenge to the analysis of this data is sequence alignment, whereby sequence reads must be compared to a reference. A wide variety of alignment algorithms and software have been subsequently developed over the past two years. In this article, we will systematically review the current development of these algorithms and introduce their practical applications on different types of experimental data. We come to the conclusion that short-read alignment is no longer the bottleneck of data analyses. We also consider future development of alignment algorithms with respect to emerging long sequence reads and the prospect of cloud computing.
PMCID: PMC2943993  PMID: 20460430
new sequencing technologies; alignment algorithm; sequence analysis
4.  Improved variant discovery through local re-alignment of short-read next-generation sequencing data using SRMA 
Genome Biology  2010;11(10):R99.
A primary component of next-generation sequencing analysis is to align short reads to a reference genome, with each read aligned independently. However, reads that observe the same non-reference DNA sequence are highly correlated and can be used to better model the true variation in the target genome. A novel short-read micro re-aligner, SRMA, that leverages this correlation to better resolve a consensus of the underlying DNA sequence of the targeted genome is described here.
PMCID: PMC3218665  PMID: 20932289
5.  Local alignment of generalized k-base encoded DNA sequence 
BMC Bioinformatics  2010;11:347.
DNA sequence comparison is a well-studied problem, in which two DNA sequences are compared using a weighted edit distance. Recent DNA sequencing technologies however observe an encoded form of the sequence, rather than each DNA base individually. The encoded DNA sequence may contain technical errors, and therefore encoded sequencing errors must be incorporated when comparing an encoded DNA sequence to a reference DNA sequence.
Although two-base encoding is currently used in practice, many other encoding schemes are possible, whereby two ore more bases are encoded at a time. A generalized k-base encoding scheme is presented, whereby feasible higher order encodings are better able to differentiate errors in the encoded sequence from true DNA sequence variants. A generalized version of the previous two-base encoding DNA sequence comparison algorithm is used to compare a k-base encoded sequence to a DNA reference sequence. Finally, simulations are performed to evaluate the power, the false positive and false negative SNP discovery rates, and the performance time of k-base encoding compared to previous methods as well as to the standard DNA sequence comparison algorithm.
The novel generalized k-base encoding scheme and resulting local alignment algorithm permits the development of higher fidelity ligation-based next generation sequencing technology. This bioinformatic solution affords greater robustness to errors, as well as lower false SNP discovery rates, only at the cost of computational time.
PMCID: PMC2911458  PMID: 20576157
6.  Statistical Comparison Framework and Visualization Scheme for Ranking-Based Algorithms in High-Throughput Genome-Wide Studies 
Journal of Computational Biology  2009;16(4):565-577.
As a first step in analyzing high-throughput data in genome-wide studies, several algorithms are available to identify and prioritize candidates lists for downstream fine-mapping. The prioritized candidates could be differentially expressed genes, aberrations in comparative genomics hybridization studies, or single nucleotide polymorphisms (SNPs) in association studies. Different analysis algorithms are subject to various experimental artifacts and analytical features that lead to different candidate lists. However, little research has been carried out to theoretically quantify the consensus between different candidate lists and to compare the study specific accuracy of the analytical methods based on a known reference candidate list. Within the context of genome-wide studies, we propose a generic mathematical framework to statistically compare ranked lists of candidates from different algorithms with each other or, if available, with a reference candidate list. To cope with the growing need for intuitive visualization of high-throughput data in genome-wide studies, we describe a complementary customizable visualization tool. As a case study, we demonstrate application of our framework to the comparison and visualization of candidate lists generated in a DNA-pooling based genome-wide association study of CEPH data in the HapMap project, where prior knowledge from individual genotyping can be used to generate a true reference candidate list. The results provide a theoretical basis to compare the accuracy of various methods and to identify redundant methods, thus providing guidance for selecting the most suitable analysis method in genome-wide studies.
PMCID: PMC3148127  PMID: 19361328
genome-wide association studies; candidate lists
7.  U87MG Decoded: The Genomic Sequence of a Cytogenetically Aberrant Human Cancer Cell Line 
PLoS Genetics  2010;6(1):e1000832.
U87MG is a commonly studied grade IV glioma cell line that has been analyzed in at least 1,700 publications over four decades. In order to comprehensively characterize the genome of this cell line and to serve as a model of broad cancer genome sequencing, we have generated greater than 30× genomic sequence coverage using a novel 50-base mate paired strategy with a 1.4kb mean insert library. A total of 1,014,984,286 mate-end and 120,691,623 single-end two-base encoded reads were generated from five slides. All data were aligned using a custom designed tool called BFAST, allowing optimal color space read alignment and accurate identification of DNA variants. The aligned sequence reads and mate-pair information identified 35 interchromosomal translocation events, 1,315 structural variations (>100 bp), 191,743 small (<21 bp) insertions and deletions (indels), and 2,384,470 single nucleotide variations (SNVs). Among these observations, the known homozygous mutation in PTEN was robustly identified, and genes involved in cell adhesion were overrepresented in the mutated gene list. Data were compared to 219,187 heterozygous single nucleotide polymorphisms assayed by Illumina 1M Duo genotyping array to assess accuracy: 93.83% of all SNPs were reliably detected at filtering thresholds that yield greater than 99.99% sequence accuracy. Protein coding sequences were disrupted predominantly in this cancer cell line due to small indels, large deletions, and translocations. In total, 512 genes were homozygously mutated, including 154 by SNVs, 178 by small indels, 145 by large microdeletions, and 35 by interchromosomal translocations to reveal a highly mutated cell line genome. Of the small homozygously mutated variants, 8 SNVs and 99 indels were novel events not present in dbSNP. These data demonstrate that routine generation of broad cancer genome sequence is possible outside of genome centers. The sequence analysis of U87MG provides an unparalleled level of mutational resolution compared to any cell line to date.
Author Summary
Glioblastoma has a particularly dismal prognosis with median survival time of less than fifteen months. Here, we describe the broad genome sequencing of U87MG, a commonly used and thus well-studied glioblastoma cell line. One of the major features of the U87MG genome is the large number of chromosomal abnormalities, which can be typical of cancer cell lines and primary cancers. The systematic, thorough, and accurate mutational analysis of the U87MG genome comprehensively identifies different classes of genetic mutations including single-nucleotide variations (SNVs), insertions/deletions (indels), and translocations. We found 2,384,470 SNVs, 191,743 small indels, and 1,314 large structural variations. Known gene models were used to predict the effect of these mutations on protein-coding sequence. Mutational analysis revealed 512 genes homozygously mutated, including 154 by SNVs, 178 by small indels, 145 by large microdeletions, and up to 35 by interchromosomal translocations. The major mutational mechanisms in this brain cancer cell line are small indels and large structural variations. The genomic landscape of U87MG is revealed to be much more complex than previously thought based on lower resolution techniques. This mutational analysis serves as a resource for past and future studies on U87MG, informing them with a thorough description of its mutational state.
PMCID: PMC2813426  PMID: 20126413
8.  Improving the efficiency of genomic loci capture using oligonucleotide arrays for high throughput resequencing 
BMC Genomics  2009;10:646.
The emergence of next-generation sequencing technology presents tremendous opportunities to accelerate the discovery of rare variants or mutations that underlie human genetic disorders. Although the complete sequencing of the affected individuals' genomes would be the most powerful approach to finding such variants, the cost of such efforts make it impractical for routine use in disease gene research. In cases where candidate genes or loci can be defined by linkage, association, or phenotypic studies, the practical sequencing target can be made much smaller than the whole genome, and it becomes critical to have capture methods that can be used to purify the desired portion of the genome for shotgun short-read sequencing without biasing allelic representation or coverage. One major approach is array-based capture which relies on the ability to create a custom in-situ synthesized oligonucleotide microarray for use as a collection of hybridization capture probes. This approach is being used by our group and others routinely and we are continuing to improve its performance.
Here, we provide a complete protocol optimized for large aggregate sequence intervals and demonstrate its utility with the capture of all predicted amino acid coding sequence from 3,038 human genes using 241,700 60-mer oligonucleotides. Further, we demonstrate two techniques by which the efficiency of the capture can be increased: by introducing a step to block cross hybridization mediated by common adapter sequences used in sequencing library construction, and by repeating the hybridization capture step. These improvements can boost the targeting efficiency to the point where over 85% of the mapped sequence reads fall within 100 bases of the targeted regions.
The complete protocol introduced in this paper enables researchers to perform practical capture experiments, and includes two novel methods for increasing the targeting efficiency. Coupled with the new massively parallel sequencing technologies, this provides a powerful approach to identifying disease-causing genetic variants that can be localized within the genome by traditional methods.
PMCID: PMC2808330  PMID: 20043857
9.  BFAST: An Alignment Tool for Large Scale Genome Resequencing 
PLoS ONE  2009;4(11):e7767.
The new generation of massively parallel DNA sequencers, combined with the challenge of whole human genome resequencing, result in the need for rapid and accurate alignment of billions of short DNA sequence reads to a large reference genome. Speed is obviously of great importance, but equally important is maintaining alignment accuracy of short reads, in the 25–100 base range, in the presence of errors and true biological variation.
We introduce a new algorithm specifically optimized for this task, as well as a freely available implementation, BFAST, which can align data produced by any of current sequencing platforms, allows for user-customizable levels of speed and accuracy, supports paired end data, and provides for efficient parallel and multi-threaded computation on a computer cluster. The new method is based on creating flexible, efficient whole genome indexes to rapidly map reads to candidate alignment locations, with arbitrary multiple independent indexes allowed to achieve robustness against read errors and sequence variants. The final local alignment uses a Smith-Waterman method, with gaps to support the detection of small indels.
We compare BFAST to a selection of large-scale alignment tools - BLAT, MAQ, SHRiMP, and SOAP - in terms of both speed and accuracy, using simulated and real-world datasets. We show BFAST can achieve substantially greater sensitivity of alignment in the context of errors and true variants, especially insertions and deletions, and minimize false mappings, while maintaining adequate speed compared to other current methods. We show BFAST can align the amount of data needed to fully resequence a human genome, one billion reads, with high sensitivity and accuracy, on a modest computer cluster in less than 24 hours. BFAST is available at
PMCID: PMC2770639  PMID: 19907642
10.  Common sequence variants on 20q11.22 confer melanoma susceptibility 
Nature genetics  2008;40(7):838-840.
We conducted a genome-wide association pooling study for cutaneous melanoma and performed validation in samples totalling 2019 cases and 2105 controls. Using pooling we identified a novel melanoma risk locus on chromosome 20 (rs910873, rs1885120), with replication in two further samples (combined P <1 × 10-15). The odds ratio is 1.75 (1.53, 2.01), with evidence for stronger association in early onset cases.
PMCID: PMC2755512  PMID: 18488026
11.  Multimarker analysis and imputation of multiple platform pooling-based genome-wide association studies 
Bioinformatics  2008;24(17):1896-1902.
Summary: For many genome-wide association (GWA) studies individually genotyping one million or more SNPs provides a marginal increase in coverage at a substantial cost. Much of the information gained is redundant due to the correlation structure inherent in the human genome. Pooling-based GWA studies could benefit significantly by utilizing this redundancy to reduce noise, improve the accuracy of the observations and increase genomic coverage. We introduce a measure of correlation between individual genotyping and pooling, under the same framework that r2 provides a measure of linkage disequilibrium (LD) between pairs of SNPs. We then report a new non-haplotype multimarker multi-loci method that leverages the correlation structure between SNPs in the human genome to increase the efficacy of pooling-based GWA studies. We first give a theoretical framework and derivation of our multimarker method. Next, we evaluate simulations using this multimarker approach in comparison to single marker analysis. Finally, we experimentally evaluate our method using different pools of HapMap individuals on the Illumina 450S Duo, Illumina 550K and Affymetrix 5.0 platforms for a combined total of 1 333 631 SNPs. Our results show that use of multimarker analysis reduces noise specific to pooling-based studies, allows for efficient integration of multiple microarray platforms and provides more accurate measures of significance than single marker analysis. Additionally, this approach can be extended to allow for imputing the association significance for SNPs not directly observed using neighboring SNPs in LD. This multimarker method can now be used to cost-effectively complete pooling-based GWA studies with multiple platforms across over one million SNPs and to impute neighboring SNPs weighted for the loss of information due to pooling.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2732219  PMID: 18617537
12.  Local alignment of two-base encoded DNA sequence 
BMC Bioinformatics  2009;10:175.
DNA sequence comparison is based on optimal local alignment of two sequences using a similarity score. However, some new DNA sequencing technologies do not directly measure the base sequence, but rather an encoded form, such as the two-base encoding considered here. In order to compare such data to a reference sequence, the data must be decoded into sequence. The decoding is deterministic, but the possibility of measurement errors requires searching among all possible error modes and resulting alignments to achieve an optimal balance of fewer errors versus greater sequence similarity.
We present an extension of the standard dynamic programming method for local alignment, which simultaneously decodes the data and performs the alignment, maximizing a similarity score based on a weighted combination of errors and edits, and allowing an affine gap penalty. We also present simulations that demonstrate the performance characteristics of our two base encoded alignment method and contrast those with standard DNA sequence alignment under the same conditions.
The new local alignment algorithm for two-base encoded data has substantial power to properly detect and correct measurement errors while identifying underlying sequence variants, and facilitating genome re-sequencing efforts based on this form of sequence data.
PMCID: PMC2709925  PMID: 19508732
13.  The Sequence Alignment/Map format and SAMtools 
Bioinformatics  2009;25(16):2078-2079.
Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.
PMCID: PMC2723002  PMID: 19505943
14.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays 
PLoS Genetics  2008;4(8):e1000167.
We use high-density single nucleotide polymorphism (SNP) genotyping microarrays to demonstrate the ability to accurately and robustly determine whether individuals are in a complex genomic DNA mixture. We first develop a theoretical framework for detecting an individual's presence within a mixture, then show, through simulations, the limits associated with our method, and finally demonstrate experimentally the identification of the presence of genomic DNA of specific individuals within a series of highly complex genomic mixtures, including mixtures where an individual contributes less than 0.1% of the total genomic DNA. These findings shift the perceived utility of SNPs for identifying individual trace contributors within a forensics mixture, and suggest future research efforts into assessing the viability of previously sub-optimal DNA sources due to sample contamination. These findings also suggest that composite statistics across cohorts, such as allele frequency or genotype counts, do not mask identity within genome-wide association studies. The implications of these findings are discussed.
Author Summary
In this report we describe a framework for accurately and robustly resolving whether individuals are in a complex genomic DNA mixture using high-density single nucleotide polymorphism (SNP) genotyping microarrays. We develop a theoretical framework for detecting an individual's presence within a mixture, show its limits through simulation, and finally demonstrate experimentally the identification of the presence of genomic DNA of individuals within a series of highly complex genomic mixtures. Our approaches demonstrate straightforward identification of trace amounts (<1%) of DNA from an individual contributor within a complex mixture. We show how probe-intensity analysis of high-density SNP data can be used, even given the experimental noise of a microarray. We discuss the implications of these findings in two fields: forensics and genome-wide association (GWA) genetic studies. Within forensics, resolving whether an individual is contributing trace amounts of genomic DNA to a complex mixture is a tremendous challenge. Within GWA studies, there is a considerable push to make experimental data publicly available so that the data can be combined with other studies. Our findings show that such an approach does not completely conceal identity, since it is straightforward to assess the probability that a person or relative participated in a GWA study.
PMCID: PMC2516199  PMID: 18769715

Results 1-14 (14)