PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (57)
 

Clipboard (0)
None

Select a Filter Below

Journals
more »
Year of Publication
more »
1.  Assessment of transcript reconstruction methods for RNA-seq 
Nature methods  2013;10(12):10.1038/nmeth.2714.
RNA sequencing (RNA-seq) is transforming genome biology, enabling comprehensive transcriptome profiling with unprecendented accuracy and detail. Due to technical limitations of current high-throughput sequencing platforms, transcript identity, structure and expression level must be inferred programmatically from partial sequence reads of fragmented gene products. We evaluated 24 protocol variants of 14 independent computational methods for exon identification, transcript reconstruction and expression level quantification from RNA-seq data. Our results show that most algorithms are able to identify discrete transcript components with high success rates, but that assembly of complete isoform structures poses a major challenge even when all constituent elements are identified. Expression level estimates also varied widely across methods, even when based on similar transcript models. Consequently, the complexity of higher eukaryotic genomes imposes severe limitations in transcript recall and splice product discrimination that are likely to remain limiting factors for the analysis of current-generation RNA-seq data.
doi:10.1038/nmeth.2714
PMCID: PMC3851240  PMID: 24185837
2.  Systematic evaluation of spliced alignment programs for RNA-seq data 
Nature methods  2013;10(12):1185-1191.
High-throughput RNA sequencing is an increasingly accessible method for studying gene structure and activity on a genome-wide scale. A critical step in RNA-seq data analysis is the alignment of partial transcript reads to a reference genome sequence. to assess the performance of current mapping software, we invited developers of RNA-seq aligners to process four large human and mouse RNA-seq data sets. in total, we compared 26 mapping protocols based on 11 programs and pipelines and found major performance differences between methods on numerous benchmarks, including alignment yield, basewise accuracy, mismatch and gap placement, exon junction discovery and suitability of alignments for transcript reconstruction. We observed concordant results on real and simulated RNA-seq data, confirming the relevance of the metrics employed. Future developments in RNA-seq alignment methods would benefit from improved placement of multimapped reads, balanced utilization of existing gene annotation and a reduced false discovery rate for splice junctions.
doi:10.1038/nmeth.2722
PMCID: PMC4018468  PMID: 24185836
3.  A Comparison of Peak Callers Used for DNase-Seq Data 
PLoS ONE  2014;9(5):e96303.
Genome-wide profiling of open chromatin regions using DNase I and high-throughput sequencing (DNase-seq) is an increasingly popular approach for finding and studying regulatory elements. A variety of algorithms have been developed to identify regions of open chromatin from raw sequence-tag data, which has motivated us to assess and compare their performance. In this study, four published, publicly available peak calling algorithms used for DNase-seq data analysis (F-seq, Hotspot, MACS and ZINBA) are assessed at a range of signal thresholds on two published DNase-seq datasets for three cell types. The results were benchmarked against an independent dataset of regulatory regions derived from ENCODE in vivo transcription factor binding data for each particular cell type. The level of overlap between peak regions reported by each algorithm and this ENCODE-derived reference set was used to assess sensitivity and specificity of the algorithms. Our study suggests that F-seq has a slightly higher sensitivity than the next best algorithms. Hotspot and the ChIP-seq oriented method, MACS, both perform competitively when used with their default parameters. However the generic peak finder ZINBA appears to be less sensitive than the other three. We also assess accuracy of each algorithm over a range of signal thresholds. In particular, we show that the accuracy of F-Seq can be considerably improved by using a threshold setting that is different from the default value.
doi:10.1371/journal.pone.0096303
PMCID: PMC4014496  PMID: 24810143
4.  Low-Bandwidth and Non-Compute Intensive Remote Identification of Microbes from Raw Sequencing Reads 
PLoS ONE  2013;8(12):e83784.
Cheap DNA sequencing may soon become routine not only for human genomes but also for practically anything requiring the identification of living organisms from their DNA: tracking of infectious agents, control of food products, bioreactors, or environmental samples. We propose a novel general approach to the analysis of sequencing data where a reference genome does not have to be specified. Using a distributed architecture we are able to query a remote server for hints about what the reference might be, transferring a relatively small amount of data. Our system consists of a server with known reference DNA indexed, and a client with raw sequencing reads. The client sends a sample of unidentified reads, and in return receives a list of matching references. Sequences for the references can be retrieved and used for exhaustive computation on the reads, such as alignment. To demonstrate this approach we have implemented a web server, indexing tens of thousands of publicly available genomes and genomic regions from various organisms and returning lists of matching hits from query sequencing reads. We have also implemented two clients: one running in a web browser, and one as a python script. Both are able to handle a large number of sequencing reads and from portable devices (the browser-based running on a tablet), perform its task within seconds, and consume an amount of bandwidth compatible with mobile broadband networks. Such client-server approaches could develop in the future, allowing a fully automated processing of sequencing data and routine instant quality check of sequencing runs from desktop sequencers. A web access is available at http://tapir.cbs.dtu.dk. The source code for a python command-line client, a server, and supplementary data are available at http://bit.ly/1aURxkc.
doi:10.1371/journal.pone.0083784
PMCID: PMC3877093  PMID: 24391826
5.  Ensembl 2014 
Nucleic Acids Research  2013;42(D1):D749-D755.
Ensembl (http://www.ensembl.org) creates tools and data resources to facilitate genomic analysis in chordate species with an emphasis on human, major vertebrate model organisms and farm animals. Over the past year we have increased the number of species that we support to 77 and expanded our genome browser with a new scrollable overview and improved variation and phenotype views. We also report updates to our core datasets and improvements to our gene homology relationships from the addition of new species. Our REST service has been extended with additional support for comparative genomics and ontology information. Finally, we provide updated information about our methods for data access and resources for user training.
doi:10.1093/nar/gkt1196
PMCID: PMC3964975  PMID: 24316576
6.  Current status and new features of the Consensus Coding Sequence database  
Nucleic Acids Research  2013;42(D1):D865-D872.
The Consensus Coding Sequence (CCDS) project (http://www.ncbi.nlm.nih.gov/CCDS/) is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies by the National Center for Biotechnology Information (NCBI) and Ensembl genome annotation pipelines. Identical annotations that pass quality assurance tests are tracked with a stable identifier (CCDS ID). Members of the collaboration, who are from NCBI, the Wellcome Trust Sanger Institute and the University of California Santa Cruz, provide coordinated and continuous review of the dataset to ensure high-quality CCDS representations. We describe here the current status and recent growth in the CCDS dataset, as well as recent changes to the CCDS web and FTP sites. These changes include more explicit reporting about the NCBI and Ensembl annotation releases being compared, new search and display options, the addition of biologically descriptive information and our approach to representing genes for which support evidence is incomplete. We also present a summary of recent and future curation targets.
doi:10.1093/nar/gkt1059
PMCID: PMC3965069  PMID: 24217909
7.  Chromatin Accessibility Data Sets Show Bias Due to Sequence Specificity of the DNase I Enzyme 
PLoS ONE  2013;8(7):e69853.
Background
DNase I is an enzyme which cuts duplex DNA at a rate that depends strongly upon its chromatin environment. In combination with high-throughput sequencing (HTS) technology, it can be used to infer genome-wide landscapes of open chromatin regions. Using this technology, systematic identification of hundreds of thousands of DNase I hypersensitive sites (DHS) per cell type has been possible, and this in turn has helped to precisely delineate genomic regulatory compartments. However, to date there has been relatively little investigation into possible biases affecting this data.
Results
We report a significant degree of sequence preference spanning sites cut by DNase I in a number of published data sets. The two major protocols in current use each show a different pattern, but for a given protocol the pattern of sequence specificity seems to be quite consistent. The patterns are substantially different from biases seen in other types of HTS data sets, and in some cases the most constrained position lies outside the sequenced fragment, implying that this constraint must relate to the digestion process rather than events occurring during library preparation or sequencing.
Conclusions
DNase I is a sequence-specific enzyme, with a specificity that may depend on experimental conditions. This sequence specificity is not taken into account by existing pipelines for identifying open chromatin regions. Care must be taken when interpreting DNase I results, especially when looking at the precise locations of the reads. Future studies may be able to improve the sensitivity and precision of chromatin state measurement by compensating for sequence bias.
doi:10.1371/journal.pone.0069853
PMCID: PMC3724795  PMID: 23922824
8.  Characterisation and Validation of Insertions and Deletions in 173 Patient Exomes 
PLoS ONE  2012;7(12):e51292.
Recent advances in genomics technologies have spurred unprecedented efforts in genome and exome re-sequencing aiming to unravel the genetic component of rare and complex disorders. While in rare disorders this allowed the identification of novel causal genes, the missing heritability paradox in complex diseases remains so far elusive. Despite rapid advances of next-generation sequencing, both the technology and the analysis of the data it produces are in its infancy. At present there is abundant knowledge pertaining to the role of rare single nucleotide variants (SNVs) in rare disorders and of common SNVs in common disorders. Although the 1,000 genome project has clearly highlighted the prevalence of rare variants and more complex variants (e.g. insertions, deletions), their role in disease is as yet far from elucidated.
We set out to analyse the properties of sequence variants identified in a comprehensive collection of exome re-sequencing studies performed on samples from patients affected by a broad range of complex and rare diseases (N = 173). Given the known potential for Loss of Function (LoF) variants to be false positive, we performed an extensive validation of the common, rare and private LoF variants identified, which indicated that most of the private and rare variants identified were indeed true, while common novel variants had a significantly higher false positive rate. Our results indicated a strong enrichment of very low-frequency insertion/deletion variants, so far under-investigated, which might be difficult to capture with low coverage and imputation approaches and for which most of study designs would be under-powered. These insertions and deletions might play a significant role in disease genetics, contributing specifically to the underlining rare and private variation predicted to be discovered through next generation sequencing.
doi:10.1371/journal.pone.0051292
PMCID: PMC3522676  PMID: 23251486
9.  Developing and implementing an institute-wide data sharing policy 
Genome Medicine  2011;3(9):60.
The Wellcome Trust Sanger Institute has a strong reputation for prepublication data sharing as a result of its policy of rapid release of genome sequence data and particularly through its contribution to the Human Genome Project. The practicalities of broad data sharing remain largely uncharted, especially to cover the wide range of data types currently produced by genomic studies and to adequately address ethical issues. This paper describes the processes and challenges involved in implementing a data sharing policy on an institute-wide scale. This includes questions of governance, practical aspects of applying principles to diverse experimental contexts, building enabling systems and infrastructure, incentives and collaborative issues.
doi:10.1186/gm276
PMCID: PMC3239235  PMID: 21955348
10.  The GENCODE pseudogene resource 
Genome Biology  2012;13(9):R51.
Background
Pseudogenes have long been considered as nonfunctional genomic sequences. However, recent evidence suggests that many of them might have some form of biological activity, and the possibility of functionality has increased interest in their accurate annotation and integration with functional genomics data.
Results
As part of the GENCODE annotation of the human genome, we present the first genome-wide pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico pipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiased fashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotations with the extensive ENCODE functional genomics information. In particular, we determine the expression level, transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Based on their distribution, we develop simple statistical models for each type of activity, which we validate with large-scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data from primate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection.
Conclusions
At one extreme, some pseudogenes possess conventional characteristics of functionality; these may represent genes that have recently died. On the other hand, we find interesting patterns of partial activity, which may suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of each pseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification of potentially functional pseudogenes.
doi:10.1186/gb-2012-13-9-r51
PMCID: PMC3491395  PMID: 22951037
11.  Evidence for Transcript Networks Composed of Chimeric RNAs in Human Cells 
PLoS ONE  2012;7(1):e28213.
The classic organization of a gene structure has followed the Jacob and Monod bacterial gene model proposed more than 50 years ago. Since then, empirical determinations of the complexity of the transcriptomes found in yeast to human has blurred the definition and physical boundaries of genes. Using multiple analysis approaches we have characterized individual gene boundaries mapping on human chromosomes 21 and 22. Analyses of the locations of the 5′ and 3′ transcriptional termini of 492 protein coding genes revealed that for 85% of these genes the boundaries extend beyond the current annotated termini, most often connecting with exons of transcripts from other well annotated genes. The biological and evolutionary importance of these chimeric transcripts is underscored by (1) the non-random interconnections of genes involved, (2) the greater phylogenetic depth of the genes involved in many chimeric interactions, (3) the coordination of the expression of connected genes and (4) the close in vivo and three dimensional proximity of the genomic regions being transcribed and contributing to parts of the chimeric RNAs. The non-random nature of the connection of the genes involved suggest that chimeric transcripts should not be studied in isolation, but together, as an RNA network.
doi:10.1371/journal.pone.0028213
PMCID: PMC3251577  PMID: 22238572
13.  The GENCODE exome: sequencing the complete human exome 
Sequencing the coding regions, the exome, of the human genome is one of the major current strategies to identify low frequency and rare variants associated with human disease traits. So far, the most widely used commercial exome capture reagents have mainly targeted the consensus coding sequence (CCDS) database. We report the design of an extended set of targets for capturing the complete human exome, based on annotation from the GENCODE consortium. The extended set covers an additional 5594 genes and 10.3 Mb compared with the current CCDS-based sets. The additional regions include potential disease genes previously inaccessible to exome resequencing studies, such as 43 genes linked to ion channel activity and 70 genes linked to protein kinase activity. In total, the new GENCODE exome set developed here covers 47.9 Mb and performed well in sequence capture experiments. In the sample set used in this study, we identified over 5000 SNP variants more in the GENCODE exome target (24%) than in the CCDS-based exome sequencing.
doi:10.1038/ejhg.2011.28
PMCID: PMC3137498  PMID: 21364695
human exome; resequencing; GENCODE
14.  Using Workflows to Explore and Optimise Named Entity Recognition for Chemistry 
PLoS ONE  2011;6(5):e20181.
Chemistry text mining tools should be interoperable and adaptable regardless of system-level implementation, installation or even programming issues. We aim to abstract the functionality of these tools from the underlying implementation via reconfigurable workflows for automatically identifying chemical names. To achieve this, we refactored an established named entity recogniser (in the chemistry domain), OSCAR and studied the impact of each component on the net performance. We developed two reconfigurable workflows from OSCAR using an interoperable text mining framework, U-Compare. These workflows can be altered using the drag-&-drop mechanism of the graphical user interface of U-Compare. These workflows also provide a platform to study the relationship between text mining components such as tokenisation and named entity recognition (using maximum entropy Markov model (MEMM) and pattern recognition based classifiers). Results indicate that, for chemistry in particular, eliminating noise generated by tokenisation techniques lead to a slightly better performance than others, in terms of named entity recognition (NER) accuracy. Poor tokenisation translates into poorer input to the classifier components which in turn leads to an increase in Type I or Type II errors, thus, lowering the overall performance. On the Sciborg corpus, the workflow based system, which uses a new tokeniser whilst retaining the same MEMM component, increases the F-score from 82.35% to 84.44%. On the PubMed corpus, it recorded an F-score of 84.84% as against 84.23% by OSCAR.
doi:10.1371/journal.pone.0020181
PMCID: PMC3102085  PMID: 21633495
15.  The Origins, Evolution, and Functional Potential of Alternative Splicing in Vertebrates 
Molecular Biology and Evolution  2011;28(10):2949-2959.
Alternative splicing (AS) has the potential to greatly expand the functional repertoire of mammalian transcriptomes. However, few variant transcripts have been characterized functionally, making it difficult to assess the contribution of AS to the generation of phenotypic complexity and to study the evolution of splicing patterns. We have compared the AS of 309 protein-coding genes in the human ENCODE pilot regions against their mouse orthologs in unprecedented detail, utilizing traditional transcriptomic and RNAseq data. The conservation status of every transcript has been investigated, and each functionally categorized as coding (separated into coding sequence [CDS] or nonsense-mediated decay [NMD] linked) or noncoding. In total, 36.7% of human and 19.3% of mouse coding transcripts are species specific, and we observe a 3.6 times excess of human NMD transcripts compared with mouse; in contrast to previous studies, the majority of species-specific AS is unlinked to transposable elements. We observe one conserved CDS variant and one conserved NMD variant per 2.3 and 11.4 genes, respectively. Subsequently, we identify and characterize equivalent AS patterns for 22.9% of these CDS or NMD-linked events in nonmammalian vertebrate genomes, and our data indicate that functional NMD-linked AS is more widespread and ancient than previously thought. Furthermore, although we observe an association between conserved AS and elevated sequence conservation, as previously reported, we emphasize that 30% of conserved AS exons display sequence conservation below the average score for constitutive exons. In conclusion, we demonstrate the value of detailed comparative annotation in generating a comprehensive set of AS transcripts, increasing our understanding of AS evolution in vertebrates. Our data supports a model whereby the acquisition of functional AS has occurred throughout vertebrate evolution and is considered alongside amino acid change as a key mechanism in gene evolution.
doi:10.1093/molbev/msr127
PMCID: PMC3176834  PMID: 21551269
alternative splicing; nonsense-mediated decay; vertebrate evolution; RBM39
16.  Dalliance: interactive genome viewing on the web 
Bioinformatics  2011;27(6):889-890.
Summary: Dalliance is a new genome viewer which offers a high level of interactivity while running within a web browser. All data is fetched using the established distributed annotation system (DAS) protocol, making it easy to customize the browser and add extra data.
Availability and Implementation: Dalliance runs entirely within your web browser, and relies on existing DAS server infrastructure. Browsers for several mammalian genomes are available at http://www.biodalliance.org/, and the use of DAS means you can add your own data to these browsers. In addition, the source code (Javascript) is available under the BSD license, and is straightforward to install on your own web server and embed within other documents.
Contact: thomas@biodalliance.org
doi:10.1093/bioinformatics/btr020
PMCID: PMC3051325  PMID: 21252075
17.  Predicting DNA-Binding Specificities of Eukaryotic Transcription Factors 
PLoS ONE  2010;5(11):e13876.
Today, annotated amino acid sequences of more and more transcription factors (TFs) are readily available. Quantitative information about their DNA-binding specificities, however, are hard to obtain. Position frequency matrices (PFMs), the most widely used models to represent binding specificities, are experimentally characterized only for a small fraction of all TFs. Even for some of the most intensively studied eukaryotic organisms (i.e., human, rat and mouse), roughly one-sixth of all proteins with annotated DNA-binding domain have been characterized experimentally. Here, we present a new method based on support vector regression for predicting quantitative DNA-binding specificities of TFs in different eukaryotic species. This approach estimates a quantitative measure for the PFM similarity of two proteins, based on various features derived from their protein sequences. The method is trained and tested on a dataset containing 1 239 TFs with known DNA-binding specificity, and used to predict specific DNA target motifs for 645 TFs with high accuracy.
doi:10.1371/journal.pone.0013876
PMCID: PMC2994704  PMID: 21152420
18.  Expanding the Landscape of Chromatin Modification (CM)-Related Functional Domains and Genes in Human 
PLoS ONE  2010;5(11):e14122.
Chromatin modification (CM) plays a key role in regulating transcription, DNA replication, repair and recombination. However, our knowledge of these processes in humans remains very limited. Here we use computational approaches to study proteins and functional domains involved in CM in humans. We analyze the abundance and the pair-wise domain-domain co-occurrences of 25 well-documented CM domains in 5 model organisms: yeast, worm, fly, mouse and human. Results show that domains involved in histone methylation, DNA methylation, and histone variants are remarkably expanded in metazoan, reflecting the increased demand for cell type-specific gene regulation. We find that CM domains tend to co-occur with a limited number of partner domains and are hence not promiscuous. This property is exploited to identify 47 potentially novel CM domains, including 24 DNA-binding domains, whose role in CM has received little attention so far. Lastly, we use a consensus Machine Learning approach to predict 379 novel CM genes (coding for 329 proteins) in humans based on domain compositions. Several of these predictions are supported by very recent experimental studies and others are slated for experimental verification. Identification of novel CM genes and domains in humans will aid our understanding of fundamental epigenetic processes that are important for stem cell differentiation and cancer biology. Information on all the candidate CM domains and genes reported here is publicly available.
doi:10.1371/journal.pone.0014122
PMCID: PMC2993927  PMID: 21124763
20.  AnnoTrack - a tracking system for genome annotation 
BMC Genomics  2010;11:538.
Background
As genome sequences are determined for increasing numbers of model organisms, demand has grown for better tools to facilitate unified genome annotation efforts by communities of biologists. Typically this process involves numerous experts from the field and the use of data from dispersed sources as evidence. This kind of collaborative annotation project requires specialized software solutions for efficient data tracking and processing.
Results
As part of the scale-up phase of the ENCODE project (Encyclopedia of DNA Elements), the aim of the GENCODE project is to produce a highly accurate evidence-based reference gene annotation for the human genome. The AnnoTrack software system was developed to aid this effort. It integrates data from multiple distributed sources, highlights conflicts and facilitates the quick identification, prioritisation and resolution of problems during the process of genome annotation.
Conclusions
AnnoTrack has been in use for the last year and has proven a very valuable tool for large-scale genome annotation. Designed to interface with standard bioinformatics components, such as DAS servers and Ensembl databases, it is easy to setup and configure for different genome projects. The source code is available at http://annotrack.sanger.ac.uk.
doi:10.1186/1471-2164-11-538
PMCID: PMC3091687  PMID: 20923551
21.  Large-scale cross-species oncogenomics identifies candidate oncogenes and tumor suppressor genes 
Cancer research  2010;70(3):883-895.
While genomic alterations identified in human tumors using techniques such as comparative genomic hybridisation (CGH) may be recurrent, they frequently encompass large regions, in some cases containing hundreds of genes. Here we combine high-resolution CGH analysis of 598 human cancer cell lines with insertion sites isolated from 1,005 mouse tumors induced with the Murine Leukaemia Virus (MuLV). This cross-species oncogenomic analysis revealed candidate tumor suppressor genes and oncogenes recurrently mutated in both human and mouse tumors, making them strong candidate cancer genes. A significant number of these genes contained binding sites for the transcription factors Oct4 and Nanog and mice carrying tumors with insertions in or near stem cell module genes, genes that are thought to participate in self-renewal, died significantly faster than mice without these insertions. The profile of MuLV insertions that we identified was compared to insertions isolated from 73 tumors induced using the Sleeping Beauty (SB) transposon system revealing significant differences in the profile of recurrently mutated genes. Collectively this work provides a rich catalogue of candidate genes for follow-up functional analysis.
doi:10.1158/0008-5472.CAN-09-1737
PMCID: PMC2880710  PMID: 20103622
Cross-species analysis; insertional mutagenesis; bioinformatics; oncogenomics; comparative genomic hybridization
22.  Metamotifs - a generative model for building families of nucleotide position weight matrices 
BMC Bioinformatics  2010;11:348.
Background
Development of high-throughput methods for measuring DNA interactions of transcription factors together with computational advances in short motif inference algorithms is expanding our understanding of transcription factor binding site motifs. The consequential growth of sequence motif data sets makes it important to systematically group and categorise regulatory motifs. It has been shown that there are familial tendencies in DNA sequence motifs that are predictive of the family of factors that binds them. Further development of methods that detect and describe familial motif trends has the potential to help in measuring the similarity of novel computational motif predictions to previously known data and sensitively detecting regulatory motifs similar to previously known ones from novel sequence.
Results
We propose a probabilistic model for position weight matrix (PWM) sequence motif families. The model, which we call the 'metamotif' describes recurring familial patterns in a set of motifs. The metamotif framework models variation within a family of sequence motifs. It allows for simultaneous estimation of a series of independent metamotifs from input position weight matrix (PWM) motif data and does not assume that all input motif columns contribute to a familial pattern. We describe an algorithm for inferring metamotifs from weight matrix data. We then demonstrate the use of the model in two practical tasks: in the Bayesian NestedMICA model inference algorithm as a PWM prior to enhance motif inference sensitivity, and in a motif classification task where motifs are labelled according to their interacting DNA binding domain.
Conclusions
We show that metamotifs can be used as PWM priors in the NestedMICA motif inference algorithm to dramatically increase the sensitivity to infer motifs. Metamotifs were also successfully applied to a motif classification problem where sequence motif features were used to predict the family of protein DNA binding domains that would interact with it. The metamotif based classifier is shown to compare favourably to previous related methods. The metamotif has great potential for further use in machine learning tasks related to especially de novo computational sequence motif inference. The metamotif methods presented have been incorporated into the NestedMICA suite.
doi:10.1186/1471-2105-11-348
PMCID: PMC2906491  PMID: 20579334
23.  Genome-wide end-sequenced BAC resources for the NOD/MrkTac☆ and NOD/ShiLtJ☆☆ mouse genomes 
Genomics  2010;95(2):105-110.
Non-obese diabetic (NOD) mice spontaneously develop type 1 diabetes (T1D) due to the progressive loss of insulin-secreting β-cells by an autoimmune driven process. NOD mice represent a valuable tool for studying the genetics of T1D and for evaluating therapeutic interventions. Here we describe the development and characterization by end-sequencing of bacterial artificial chromosome (BAC) libraries derived from NOD/MrkTac (DIL NOD) and NOD/ShiLtJ (CHORI-29), two commonly used NOD substrains. The DIL NOD library is composed of 196,032 BACs and the CHORI-29 library is composed of 110,976 BACs. The average depth of genome coverage of the DIL NOD library, estimated from mapping the BAC end-sequences to the reference mouse genome sequence, was 7.1-fold across the autosomes and 6.6-fold across the X chromosome. Clones from this library have an average insert size of 150 kb and map to over 95.6% of the reference mouse genome assembly (NCBIm37), covering 98.8% of Ensembl mouse genes. By the same metric, the CHORI-29 library has an average depth over the autosomes of 5.0-fold and 2.8-fold coverage of the X chromosome, the reduced X chromosome coverage being due to the use of a male donor for this library. Clones from this library have an average insert size of 205 kb and map to 93.9% of the reference mouse genome assembly, covering 95.7% of Ensembl genes. We have identified and validated 191,841 single nucleotide polymorphisms (SNPs) for DIL NOD and 114,380 SNPs for CHORI-29. In total we generated 229,736,133 bp of sequence for the DIL NOD and 121,963,211 bp for the CHORI-29. These BAC libraries represent a powerful resource for functional studies, such as gene targeting in NOD embryonic stem (ES) cell lines, and for sequencing and mapping experiments.
doi:10.1016/j.ygeno.2009.10.004
PMCID: PMC2824108  PMID: 19909804
Bacterial artificial chromosome; NOD/MrkTac; NOD/ShiLtJ; Mouse genome; Non-obese diabetic (NOD); Type 1 diabetes; T1D; Insulin-dependent diabetes; IDD
24.  iMotifs: an integrated sequence motif visualization and analysis environment 
Bioinformatics  2010;26(6):843-844.
Motivation: Short sequence motifs are an important class of models in molecular biology, used most commonly for describing transcription factor binding site specificity patterns. High-throughput methods have been recently developed for detecting regulatory factor binding sites in vivo and in vitro and consequently high-quality binding site motif data are becoming available for increasing number of organisms and regulatory factors. Development of intuitive tools for the study of sequence motifs is therefore important.
iMotifs is a graphical motif analysis environment that allows visualization of annotated sequence motifs and scored motif hits in sequences. It also offers motif inference with the sensitive NestedMICA algorithm, as well as overrepresentation and pairwise motif matching capabilities. All of the analysis functionality is provided without the need to convert between file formats or learn different command line interfaces.
The application includes a bundled and graphically integrated version of the NestedMICA motif inference suite that has no outside dependencies. Problems associated with local deployment of software are therefore avoided.
Availability: iMotifs is licensed with the GNU Lesser General Public License v2.0 (LGPL 2.0). The software and its source is available at http://wiki.github.com/mz2/imotifs and can be run on Mac OS X Leopard (Intel/PowerPC). We also provide a cross-platform (Linux, OS X, Windows) LGPL 2.0 licensed library libxms for the Perl, Ruby, R and Objective-C programming languages for input and output of XMS formatted annotated sequence motif set files.
Contact: matias.piipari@gmail.com; imotifs@googlegroups.com
doi:10.1093/bioinformatics/btq026
PMCID: PMC2832821  PMID: 20106815
25.  Discovery of Candidate Disease Genes in ENU–Induced Mouse Mutants by Large-Scale Sequencing, Including a Splice-Site Mutation in Nucleoredoxin 
PLoS Genetics  2009;5(12):e1000759.
An accurate and precisely annotated genome assembly is a fundamental requirement for functional genomic analysis. Here, the complete DNA sequence and gene annotation of mouse Chromosome 11 was used to test the efficacy of large-scale sequencing for mutation identification. We re-sequenced the 14,000 annotated exons and boundaries from over 900 genes in 41 recessive mutant mouse lines that were isolated in an N-ethyl-N-nitrosourea (ENU) mutation screen targeted to mouse Chromosome 11. Fifty-nine sequence variants were identified in 55 genes from 31 mutant lines. 39% of the lesions lie in coding sequences and create primarily missense mutations. The other 61% lie in noncoding regions, many of them in highly conserved sequences. A lesion in the perinatal lethal line l11Jus13 alters a consensus splice site of nucleoredoxin (Nxn), inserting 10 amino acids into the resulting protein. We conclude that point mutations can be accurately and sensitively recovered by large-scale sequencing, and that conserved noncoding regions should be included for disease mutation identification. Only seven of the candidate genes we report have been previously targeted by mutation in mice or rats, showing that despite ongoing efforts to functionally annotate genes in the mammalian genome, an enormous gap remains between phenotype and function. Our data show that the classical positional mapping approach of disease mutation identification can be extended to large target regions using high-throughput sequencing.
Author Summary
Here we show that tiny DNA lesions can be found in huge amounts of DNA sequence data, similar to finding a needle in a haystack. These lesions identify many new candidates for disease genes associated with birth defects, infertility, and growth. Further, our data suggest that we know very little about what mammalian genes do. Sequencing methods are becoming cheaper and faster. Therefore, our strategy, shown here for the first time, will become commonplace.
doi:10.1371/journal.pgen.1000759
PMCID: PMC2782131  PMID: 20011118

Results 1-25 (57)