PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-14 (14)
 

Clipboard (0)
None

Select a Filter Below

Journals
Year of Publication
more »
Document Types
author:("tli, Mary Ann")
1.  Automatic categorization of diverse experimental information in the bioscience literature 
BMC Bioinformatics  2012;13:16.
Background
Curation of information from bioscience literature into biological knowledge databases is a crucial way of capturing experimental information in a computable form. During the biocuration process, a critical first step is to identify from all published literature the papers that contain results for a specific data type the curator is interested in annotating. This step normally requires curators to manually examine many papers to ascertain which few contain information of interest and thus, is usually time consuming. We developed an automatic method for identifying papers containing these curation data types among a large pool of published scientific papers based on the machine learning method Support Vector Machine (SVM). This classification system is completely automatic and can be readily applied to diverse experimental data types. It has been in use in production for automatic categorization of 10 different experimental datatypes in the biocuration process at WormBase for the past two years and it is in the process of being adopted in the biocuration process at FlyBase and the Saccharomyces Genome Database (SGD). We anticipate that this method can be readily adopted by various databases in the biocuration community and thereby greatly reducing time spent on an otherwise laborious and demanding task. We also developed a simple, readily automated procedure to utilize training papers of similar data types from different bodies of literature such as C. elegans and D. melanogaster to identify papers with any of these data types for a single database. This approach has great significance because for some data types, especially those of low occurrence, a single corpus often does not have enough training papers to achieve satisfactory performance.
Results
We successfully tested the method on ten data types from WormBase, fifteen data types from FlyBase and three data types from Mouse Genomics Informatics (MGI). It is being used in the curation work flow at WormBase for automatic association of newly published papers with ten data types including RNAi, antibody, phenotype, gene regulation, mutant allele sequence, gene expression, gene product interaction, overexpression phenotype, gene interaction, and gene structure correction.
Conclusions
Our methods are applicable to a variety of data types with training set containing several hundreds to a few thousand documents. It is completely automatic and, thus can be readily incorporated to different workflow at different literature-based databases. We believe that the work presented here can contribute greatly to the tremendous task of automating the important yet labor-intensive biocuration effort.
doi:10.1186/1471-2105-13-16
PMCID: PMC3305665  PMID: 22280404
2.  WormBase 
Worm  2012;1(1):15-21.
WormBase (www.wormbase.org) has been serving the scientific community for over 11 years as the central repository for genomic and genetic information for the soil nematode Caenorhabditis elegans. The resource has evolved from its beginnings as a database housing the genomic sequence and genetic and physical maps of a single species, and now represents the breadth and diversity of nematode research, currently serving genome sequence and annotation for around 20 nematodes. In this article, we focus on WormBase’s role of genome sequence annotation, describing how we annotate and integrate data from a growing collection of nematode species and strains. We also review our approaches to sequence curation, and discuss the impact on annotation quality of large functional genomics projects such as modENCODE.
doi:10.4161/worm.19574
PMCID: PMC3670165  PMID: 24058818
Caenorhabditis elegans; annotation; community resource; genome; model organism database; nematode; parasitic nematode; sequence curation
3.  The EMBL Nucleotide Sequence Database 
Nucleic Acids Research  2004;32(Database issue):D27-D30.
The EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl/), maintained at the European Bioinformatics Institute (EBI), incorporates, organizes and distributes nucleotide sequences from public sources. The database is a part of an international collaboration with DDBJ (Japan) and GenBank (USA). Data are exchanged between the collaborating databases on a daily basis to achieve optimal synchrony. The web-based tool, Webin, is the preferred system for individual submission of nucleotide sequences, including Third Party Annotation (TPA) and alignment data. Automatic submission procedures are used for submission of data from large-scale genome sequencing centres and from the European Patent Office. Database releases are produced quarterly. The latest data collection can be accessed via FTP, email and WWW interfaces. The EBI’s Sequence Retrieval System (SRS) integrates and links the main nucleotide and protein databases as well as many other specialist molecular biology databases. For sequence similarity searching, a variety of tools (e.g. FASTA and BLAST) are available that allow external users to compare their own sequences against the data in the EMBL Nucleotide Sequence Database, the complete genomic component subsection of the database, the WGS data sets and other databases. All available resources can be accessed via the EBI home page at http://www.ebi.ac.uk.
doi:10.1093/nar/gkh120
PMCID: PMC308854  PMID: 14681351
4.  The EMBL Nucleotide Sequence Database: major new developments 
Nucleic Acids Research  2003;31(1):17-22.
The EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl/) incorporates, organizes and distributes nucleotide sequences from all available public sources. The database is located and maintained at the European Bioinformatics Institute (EBI) near Cambridge, UK. In an international collaboration with DDBJ (Japan) and GenBank (USA), data are exchanged amongst the collaborating databases on a daily basis to achieve optimal synchronization. Webin is the preferred web-based submission system for individual submitters, while automatic procedures allow incorporation of sequence data from large-scale genome sequencing centres and from the European Patent Office (EPO). Database releases are produced quarterly. Network services allow free access to the most up-to-date data collection via FTP, Email and World Wide Web interfaces. EBI's Sequence Retrieval System (SRS) integrates and links the main nucleotide and protein databases plus many other specialized molecular biology databases. For sequence similarity searching, a variety of tools (e.g. Fasta, BLAST) are available which allow external users to compare their own sequences against the latest data in the EMBL Nucleotide Sequence Database and SWISS-PROT. All resources can be accessed via the EBI home page at http://www.ebi.ac.uk.
PMCID: PMC165468  PMID: 12519939
5.  The European Bioinformatics Institute's data resources 
Nucleic Acids Research  2003;31(1):43-50.
As the amount of biological data grows, so does the need for biologists to store and access this information in central repositories in a free and unambiguous manner. The European Bioinformatics Institute (EBI) hosts six core databases, which store information on DNA sequences (EMBL-Bank), protein sequences (SWISS-PROT and TrEMBL), protein structure (MSD), whole genomes (Ensembl) and gene expression (ArrayExpress). But just as a cell would be useless if it couldn't transcribe DNA or translate RNA, our resources would be compromised if each existed in isolation. We have therefore developed a range of tools that not only facilitate the deposition and retrieval of biological information, but also allow users to carry out searches that reflect the interconnectedness of biological information. The EBI's databases and tools are all available on our website at www.ebi.ac.uk.
PMCID: PMC165513  PMID: 12519944
6.  The EMBL Nucleotide Sequence Database 
Nucleic Acids Research  2002;30(1):21-26.
The EMBL Nucleotide Sequence Database (aka EMBL-Bank; http://www.ebi.ac.uk/embl/) incorporates, organises and distributes nucleotide sequences from all available public sources. EMBL-Bank is located and maintained at the European Bioinformatics Institute (EBI) near Cambridge, UK. In an international collaboration with DDBJ (Japan) and GenBank (USA), data are exchanged amongst the collaborating databases on a daily basis. Major contributors to the EMBL database are individual scientists and genome project groups. Webin is the preferred web-based submission system for individual submitters, whilst automatic procedures allow incorporation of sequence data from large-scale genome sequencing centres and from the European Patent Office (EPO). Database releases are produced quarterly. Network services allow free access to the most up-to-date data collection via FTP, email and World Wide Web interfaces. EBI’s Sequence Retrieval System (SRS), a network browser for databanks in molecular biology, integrates and links the main nucleotide and protein databases plus many other specialized databases. For sequence similarity searching, a variety of tools (e.g. Blitz, Fasta, BLAST) are available which allow external users to compare their own sequences against the latest data in the EMBL Nucleotide Sequence Database and SWISS-PROT. All resources can be accessed via the EBI home page at http://www.ebi.ac.uk.
PMCID: PMC99098  PMID: 11752244
7.  The EMBL nucleotide sequence database 
Nucleic Acids Research  2001;29(1):17-21.
The EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl/) is maintained at the European Bioinformatics Institute (EBI) in an international collaboration with the DNA Data Bank of Japan (DDBJ) and GenBank at the NCBI (USA). Data is exchanged amongst the collaborating databases on a daily basis. The major contributors to the EMBL database are individual authors and genome project groups. Webin is the preferred web-based submission system for individual submitters, whilst automatic procedures allow incorporation of sequence data from large-scale genome sequencing centres and from the European Patent Office (EPO). Database releases are produced quarterly. Network services allow free access to the most up-to-date data collection via ftp, email and World Wide Web interfaces. EBI’s Sequence Retrieval System (SRS), a network browser for databanks in molecular biology, integrates and links the main nucleotide and protein databases plus many specialized databases. For sequence similarity searching a variety of tools (e.g. Blitz, Fasta, BLAST) are available which allow external users to compare their own sequences against the latest data in the EMBL Nucleotide Sequence Database and SWISS-PROT.
PMCID: PMC29766  PMID: 11125039
8.  The EMBL Nucleotide Sequence Database 
Nucleic Acids Research  2000;28(1):19-23.
The European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database (http://www.ebi.ac. uk/embl/index.html ) is maintained at the European Bioinformatics Institute (EBI) in an international collaboration with the DNA Data Bank of Japan (DDBJ) and GenBank (USA). Data is exchanged amongst the collaborative databases on a daily basis. The major contributors to the EMBL database are individual authors and genome project groups. WEBIN is the preferred web-based submission system for individual submitters, whilst automatic procedures allow incorporation of sequence data from large-scale genome sequencing centres and from the European Patent Office (EPO). Database releases are produced quarterly. Network services allow free access to the most up-to-date data collection via Internet and WWW interfaces. EBI’s Sequence Retrieval System (SRS) is a network browser for databanks in molecular biology, integrating and linking the main nucleotide and protein databases plus many specialised databases. For sequence similarity searching a variety of tools (e.g., BLITZ, FASTA, BLAST) are available which allow external users to compare their own sequences against the most currently available data in the EMBL Nucleotide Sequence Database and SWISS-PROT.
PMCID: PMC102461  PMID: 10592171
9.  WormBase 2014: new views of curated biology 
Nucleic Acids Research  2013;42(D1):D789-D793.
WormBase (http://www.wormbase.org/) is a highly curated resource dedicated to supporting research using the model organism Caenorhabditis elegans. With an electronic history predating the World Wide Web, WormBase contains information ranging from the sequence and phenotype of individual alleles to genome-wide studies generated using next-generation sequencing technologies. In recent years, we have expanded the contents to include data on additional nematodes of agricultural and medical significance, bringing the knowledge of C. elegans to bear on these systems and providing support for underserved research communities. Manual curation of the primary literature remains a central focus of the WormBase project, providing users with reliable, up-to-date and highly cross-linked information. In this update, we describe efforts to organize the original atomized and highly contextualized curated data into integrated syntheses of discrete biological topics. Next, we discuss our experiences coping with the vast increase in available genome sequences made possible through next-generation sequencing platforms. Finally, we describe some of the features and tools of the new WormBase Web site that help users better find and explore data of interest.
doi:10.1093/nar/gkt1063
PMCID: PMC3965043  PMID: 24194605
10.  Ensembl Genomes 2013: scaling up access to genome-wide data 
Nucleic Acids Research  2013;42(D1):D546-D552.
Ensembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species. The project exploits and extends technologies for genome annotation, analysis and dissemination, developed in the context of the vertebrate-focused Ensembl project, and provides a complementary set of resources for non-vertebrate species through a consistent set of programmatic and interactive interfaces. These provide access to data including reference sequence, gene models, transcriptional data, polymorphisms and comparative analysis. This article provides an update to the previous publications about the resource, with a focus on recent developments. These include the addition of important new genomes (and related data sets) including crop plants, vectors of human disease and eukaryotic pathogens. In addition, the resource has scaled up its representation of bacterial genomes, and now includes the genomes of over 9000 bacteria. Specific extensions to the web and programmatic interfaces have been developed to support users in navigating these large data sets. Looking forward, analytic tools to allow targeted selection of data for visualization and download are likely to become increasingly important in future as the number of available genomes increases within all domains of life, and some of the challenges faced in representing bacterial data are likely to become commonplace for eukaryotes in future.
doi:10.1093/nar/gkt979
PMCID: PMC3965094  PMID: 24163254
11.  WormBase: a comprehensive resource for nematode research 
Nucleic Acids Research  2009;38(Database issue):D463-D467.
WormBase (http://www.wormbase.org) is a central data repository for nematode biology. Initially created as a service to the Caenorhabditis elegans research field, WormBase has evolved into a powerful research tool in its own right. In the past 2 years, we expanded WormBase to include the complete genomic sequence, gene predictions and orthology assignments from a range of related nematodes. This comparative data enrich the C. elegans data with improved gene predictions and a better understanding of gene function. In turn, they bring the wealth of experimental knowledge of C. elegans to other systems of medical and agricultural importance. Here, we describe new species and data types now available at WormBase. In addition, we detail enhancements to our curatorial pipeline and website infrastructure to accommodate new genomes and an extensive user base.
doi:10.1093/nar/gkp952
PMCID: PMC2808986  PMID: 19910365
12.  WormBase 2007 
Nucleic Acids Research  2007;36(Database issue):D612-D617.
WormBase (www.wormbase.org) is the major publicly available database of information about Caenorhabditis elegans, an important system for basic biological and biomedical research. Derived from the initial ACeDB database of C. elegans genetic and sequence information, WormBase now includes the genomic, anatomical and functional information about C. elegans, other Caenorhabditis species and other nematodes. As such, it is a crucial resource not only for C. elegans biologists but the larger biomedical and bioinformatics communities. Coverage of core areas of C. elegans biology will allow the biomedical community to make full use of the results of intensive molecular genetic analysis and functional genomic studies of this organism. Improved search and display tools, wider cross-species comparisons and extended ontologies are some of the features that will help scientists extend their research and take advantage of other nematode species genome sequences.
doi:10.1093/nar/gkm975
PMCID: PMC2238927  PMID: 17991679
13.  WormBase: new content and better access 
Nucleic Acids Research  2006;35(Database issue):D506-D510.
WormBase (), a model organism database for Caenorhabditis elegans and other related nematodes, continues to evolve and expand. Over the past year WormBase has added new data on C.elegans, including data on classical genetics, cell biology and functional genomics; expanded the annotation of closely related nematodes with a new genome browser for Caenorhabditis remanei; and deployed new hardware for stronger performance. Several existing datasets including phenotype descriptions and RNAi experiments have seen a large increase in new content. New datasets such as the C.remanei draft assembly and annotations, the Vancouver Fosmid library and TEC-RED 5′ end sites are now available as well. Access to and searching WormBase has become more dependable and flexible via multiple mirror sites and indexing through Google.
doi:10.1093/nar/gkl818
PMCID: PMC1669750  PMID: 17099234
14.  WormBase: better software, richer content 
Nucleic Acids Research  2005;34(Database issue):D475-D478.
WormBase (), the public database for genomics and biology of Caenorhabditis elegans, has been restructured for stronger performance and expanded for richer biological content. Performance was improved by accelerating the loading of central data pages such as the omnibus Gene page, by rationalizing internal data structures and software for greater portability, and by making the Genome Browser highly customizable in how it views and exports genomic subsequences. Arbitrarily complex, user-specified queries are now possible through Textpresso (for all available literature) and through WormMart (for most genomic data). Biological content was enriched by reconciling all available cDNA and expressed sequence tag data with gene predictions, clarifying single nucleotide polymorphism and RNAi sites, and summarizing known functions for most genes studied in this organism.
doi:10.1093/nar/gkj061
PMCID: PMC1347424  PMID: 16381915

Results 1-14 (14)