Biodiversity data is being digitized and made available online at a rapidly increasing rate but current practices typically do not preserve linkages between these data, which impedes interoperation, provenance tracking, and assembly of larger datasets. For data associated with biocollections, the biodiversity community has long recognized that an essential part of establishing and preserving linkages is to apply globally unique identifiers at the point when data are generated in the field and to persist these identifiers downstream, but this is seldom implemented in practice. There has neither been coalescence towards one single identifier solution (as in some other domains), nor even a set of recommended best practices and standards to support multiple identifier schemes sharing consistent responses. In order to further progress towards a broader community consensus, a group of biocollections and informatics experts assembled in Stockholm in October 2014 to discuss community next steps to overcome current roadblocks. The workshop participants divided into four groups focusing on: identifier practice in current field biocollections; identifier application for legacy biocollections; identifiers as applied to biodiversity data records as they are published and made available in semantically marked-up publications; and cross-cutting identifier solutions that bridge across these domains. The main outcome was consensus on key issues, including recognition of differences between legacy and new biocollections processes, the need for identifier metadata profiles that can report information on identifier persistence missions, and the unambiguous indication of the type of object associated with the identifier. Current identifier characteristics are also summarized, and an overview of available schemes and practices is provided.
Biocollections; identifiers; Globally Unique Identifiers; GUIDs; field collections; legacy collections; linked open data; semantic publishing
One of the grand goals of historical biogeography is to understand how and why species' population sizes and distributions change over time. Multiple types of data drawn from disparate fields, combined into a single modelling framework, are necessary to document changes in a species's demography and distribution, and to determine the drivers responsible for change. Yet truly integrated approaches are challenging and rarely performed. Here, we discuss a modelling framework that integrates spatio-temporal fossil data, ancient DNA, palaeoclimatological reconstructions, bioclimatic envelope modelling and coalescence models in order to statistically test alternative hypotheses of demographic and potential distributional changes for the iconic American bison (Bison bison). Using different assumptions about the evolution of the bioclimatic niche, we generate hypothetical distributional and demographic histories of the species. We then test these demographic models by comparing the genetic signature predicted by serial coalescence against sequence data derived from subfossils and modern populations. Our results supported demographic models that include both climate and human-associated drivers of population declines. This synthetic approach, integrating palaeoclimatology, bioclimatic envelopes, serial coalescence, spatio-temporal fossil data and heterochronous DNA sequences, improves understanding of species' historical biogeography by allowing consideration of both abiotic and biotic interactions at the population level.
ancient DNA; bison; bioclimatic envelope models; Late Quaternary; historical biogeography; palaeoclimatic reconstructions
This report describes the outcomes of a recent workshop, building on a series of workshops from the last three years with the goal if integrating genomics and biodiversity research, with a more specific goal here to express terms in Darwin Core and Audubon Core, where class constructs have been historically underspecified, into a Biological Collections Ontology (BCO) framework. For the purposes of this workshop, the BCO provided the context for fully defining classes as well as object and data properties, including domain and range information, for both the Darwin Core and Audubon Core. In addition, the workshop participants reviewed technical specifications and approaches for annotating instance data with BCO terms. Finally, we laid out proposed activities for the next 3 to 18 months to continue this work.
Ontology; Biodiversity; Population; Community; Darwin core; OWL; RDF; Microbial ecology; Sequencing
The biodiversity informatics community has discussed aspirations and approaches for assigning globally unique identifiers (GUIDs) to biocollections for nearly a decade. During that time, and despite misgivings, the de facto standard identifier has become the “Darwin Core Triplet”, which is a concatenation of values for institution code, collection code, and catalog number associated with biocollections material. Our aim is not to rehash the challenging discussions regarding which GUID system in theory best supports the biodiversity informatics use case of discovering and linking digital data across the Internet, but how well we can link those data together at this moment, utilizing the current identifier schemes that have already been deployed. We gathered Darwin Core Triplets from a subset of VertNet records, along with vertebrate records from GenBank and the Barcode of Life Data System, in order to determine how Darwin Core Triplets are deployed “in the wild”. We asked if those triplets follow the recommended structure and whether they provide an easy and unambiguous means to track from specimen records to genetic sequence records. We show that Darwin Core Triplets are often riddled with semantic and syntactic errors when deployed and curated in practice, despite specifications about how to construct them. Our results strongly suggest that Darwin Core Triplets that have not been carefully curated are not currently serving a useful role for relinking data. We briefly consider needed next steps to overcome current limitations.
The planet is experiencing an ongoing global biodiversity crisis. Measuring the magnitude and rate of change more effectively requires access to organized, easily discoverable, and digitally-formatted biodiversity data, both legacy and new, from across the globe. Assembling this coherent digital representation of biodiversity requires the integration of data that have historically been analog, dispersed, and heterogeneous. The Integrated Publishing Toolkit (IPT) is a software package developed to support biodiversity dataset publication in a common format. The IPT’s two primary functions are to 1) encode existing species occurrence datasets and checklists, such as records from natural history collections or observations, in the Darwin Core standard to enhance interoperability of data, and 2) publish and archive data and metadata for broad use in a Darwin Core Archive, a set of files following a standard format. Here we discuss the key need for the IPT, how it has developed in response to community input, and how it continues to evolve to streamline and enhance the interoperability, discoverability, and mobilization of new data types beyond basic Darwin Core records. We close with a discussion how IPT has impacted the biodiversity research community, how it enhances data publishing in more traditional journal venues, along with new features implemented in the latest version of the IPT, and future plans for more enhancements.
Recent years have brought great progress in efforts to digitize the world’s biodiversity data, but integrating data from many different providers, and across research domains, remains challenging. Semantic Web technologies have been widely recognized by biodiversity scientists for their potential to help solve this problem, yet these technologies have so far seen little use for biodiversity data. Such slow uptake has been due, in part, to the relative complexity of Semantic Web technologies along with a lack of domain-specific software tools to help non-experts publish their data to the Semantic Web.
The BiSciCol Triplifier is new software that greatly simplifies the process of converting biodiversity data in standard, tabular formats, such as Darwin Core-Archives, into Semantic Web-ready Resource Description Framework (RDF) representations. The Triplifier uses a vocabulary based on the popular Darwin Core standard, includes both Web-based and command-line interfaces, and is fully open-source software.
Unlike most other RDF conversion tools, the Triplifier does not require detailed familiarity with core Semantic Web technologies, and it is tailored to a widely popular biodiversity data format and vocabulary standard. As a result, the Triplifier can often fully automate the conversion of biodiversity data to RDF, thereby making the Semantic Web much more accessible to biodiversity scientists who might otherwise have relatively little knowledge of Semantic Web technologies. Easy availability of biodiversity data as RDF will allow researchers to combine data from disparate sources and analyze them with powerful linked data querying tools. However, before software like the Triplifier, and Semantic Web technologies in general, can reach their full potential for biodiversity science, the biodiversity informatics community must address several critical challenges, such as the widespread failure to use robust, globally unique identifiers for biodiversity data.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2105-15-257) contains supplementary material, which is available to authorized users.
Biocollections; Biodiversity informatics; Darwin core; Linked data; Ontology; RDF; Semantic web; SPARQL
Scientific names of biological entities offer an imperfect resolution of the concepts that they are intended to represent. Often they are labels applied to entities ranging from entire populations to individual specimens representing those populations, even though such names only unambiguously identify the type specimen to which they were originally attached. Thus the real-life referents of names are constantly changing as biological circumscriptions are redefined and thereby alter the sets of individuals bearing those names. This problem is compounded by other characteristics of names that make them ambiguous identifiers of biological concepts, including emendations, homonymy and synonymy. Taxonomic concepts have been proposed as a way to address issues related to scientific names, but they have yet to receive broad recognition or implementation. Some efforts have been made towards building systems that address these issues by cataloguing and organizing taxonomic concepts, but most are still in conceptual or proof-of-concept stage. We present the on-line database Avibase as one possible approach to organizing taxonomic concepts. Avibase has been successfully used to describe and organize 844,000 species-level and 705,000 subspecies-level taxonomic concepts across every major bird taxonomic checklist of the last 125 years. The use of taxonomic concepts in place of scientific names, coupled with efficient resolution services, is a major step toward addressing some of the main deficiencies in the current practices of scientific name dissemination and use.
Biodiversity informatics; scientific names; taxon circumscription; taxonomic concepts; taxonomic database
The study of biodiversity spans many disciplines and includes data pertaining to species distributions and abundances, genetic sequences, trait measurements, and ecological niches, complemented by information on collection and measurement protocols. A review of the current landscape of metadata standards and ontologies in biodiversity science suggests that existing standards such as the Darwin Core terminology are inadequate for describing biodiversity data in a semantically meaningful and computationally useful way. Existing ontologies, such as the Gene Ontology and others in the Open Biological and Biomedical Ontologies (OBO) Foundry library, provide a semantic structure but lack many of the necessary terms to describe biodiversity data in all its dimensions. In this paper, we describe the motivation for and ongoing development of a new Biological Collections Ontology, the Environment Ontology, and the Population and Community Ontology. These ontologies share the aim of improving data aggregation and integration across the biodiversity domain and can be used to describe physical samples and sampling processes (for example, collection, extraction, and preservation techniques), as well as biodiversity observations that involve no physical sampling. Together they encompass studies of: 1) individual organisms, including voucher specimens from ecological studies and museum specimens, 2) bulk or environmental samples (e.g., gut contents, soil, water) that include DNA, other molecules, and potentially many organisms, especially microbes, and 3) survey-based ecological observations. We discuss how these ontologies can be applied to biodiversity use cases that span genetic, organismal, and ecosystem levels of organization. We argue that if adopted as a standard and rigorously applied and enriched by the biodiversity community, these ontologies would significantly reduce barriers to data discovery, integration, and exchange among biodiversity resources and researchers.
We describe the outcomes of three recent workshops aimed at advancing development of the Biological Collections Ontology (BCO), the Population and Community Ontology (PCO), and tools to annotate data using those and other ontologies. The first workshop gathered use cases to help grow the PCO, agreed upon a format for modeling challenging concepts such as ecological niche, and developed ontology design patterns for defining collections of organisms and population-level phenotypes. The second focused on mapping datasets to ontology terms and converting them to Resource Description Framework (RDF), using the BCO. To follow-up, a BCO hackathon was held concurrently with the 16th Genomics Standards Consortium Meeting, during which we converted additional datasets to RDF, developed a Material Sample Core for the Global Biodiversity Information Framework, created a Web Ontology Language (OWL) file for importing Darwin Core classes and properties into BCO, and developed a workflow for converting biodiversity data among formats.
Ontology; Biodiversity; Population; Community; Darwin Core; OWL; RDF; Microbial ecology; Sequencing
Determining the magnitude of climate change patterns across elevational gradients is essential for an improved understanding of broader climate change patterns and for predicting hydrologic and ecosystem changes. We present temperature trends from five long-term weather stations along a 2077-meter elevational transect in the Rocky Mountain Front Range of Colorado, USA. These trends were measured over two time periods: a full 56-year record (1953–2008) and a shorter 20-year (1989–2008) record representing a period of widely reported accelerating change. The rate of change of biological indicators, season length and accumulated growing-degree days, were also measured over the 56 and 20-year records. Finally, we compared how well interpolated Parameter-elevation Regression on Independent Slopes Model (PRISM) datasets match the quality controlled and weather data from each station. Our results show that warming signals were strongest at mid-elevations over both temporal scales. Over the 56-year record, most sites show warming occurring largely through increases in maximum temperatures, while the 20-year record documents warming associated with increases in maximum temperatures at lower elevations and increases in minimum temperatures at higher elevations. Recent decades have also shown a shift from warming during springtime to warming in July and November. Warming along the gradient has contributed to increases in growing-degree days, although to differing degrees, over both temporal scales. However, the length of the growing season has remained unchanged. Finally, the actual and the PRISM interpolated yearly rates rarely showed strong correlations and suggest different warming and cooling trends at most sites. Interpretation of climate trends and their seasonal biases in the Rocky Mountain Front Range are dependent on both elevation and the temporal scale of analysis. Given mismatches between interpolated data and the directly measured station data, we caution against an over-reliance on interpolation methods for documenting local patterns of climatic change.
Anthropogenic effects on wildlife are typically assessed at the local level, but it is often difficult to extrapolate to larger spatial extents. Macro-level occupancy studies are one way to assess impacts of multiple disturbance factors that might vary over different geographic extents. Here we assess anthropogenic effects on occupancy and distribution for several mammal species within the Appalachian Trail (AT), a forest corridor that extends across a broad section of the eastern United States. Utilizing camera traps and a large volunteer network of citizen scientists, we were able to sample 447 sites along a 1024 km section of the AT to assess the effects of available habitat, hunting, recreation, and roads on eight mammal species. Occupancy modeling revealed the importance of available forest to all species except opossums (Didelphis virginiana) and coyotes (Canis latrans). Hunting on adjoining lands was the second strongest predictor of occupancy for three mammal species, negatively influencing black bears (Ursus americanus) and bobcats (Lynx rufus), while positively influencing raccoons (Procyon lotor). Modeling also indicated an avoidance of high trail use areas by bears and proclivity towards high use areas by red fox (Vulpes vulpes). Roads had the lowest predictive power on species occupancy within the corridor and were only significant for deer. The occupancy models stress the importance of compounding direct and indirect anthropogenic influences operating at the regional level. Scientists and managers should consider these human impacts and their potential combined influence on wildlife persistence when assessing optimal habitat or considering management actions.
Legacy data from natural history collections contain invaluable and irreplaceable information about biodiversity in the recent past, providing a baseline for detecting change and forecasting the future of biodiversity on a human-dominated planet. However, these data are often not available in formats that facilitate use and synthesis. New approaches are needed to enhance the rates of digitization and data quality improvement. Notes from Nature provides one such novel approach by asking citizen scientists to help with transcription tasks. The initial web-based prototype of Notes from Nature is soon widely available and was developed collaboratively by biodiversity scientists, natural history collections staff, and experts in citizen science project development, programming and visualization. This project brings together digital images representing different types of biodiversity records including ledgers , herbarium sheets and pinned insects from multiple projects and natural history collections. Experts in developing web-based citizen science applications then designed and built a platform for transcribing textual data and metadata from these images. The end product is a fully open source web transcription tool built using the latest web technologies. The platform keeps volunteers engaged by initially explaining the scientific importance of the work via a short orientation, and then providing transcription “missions” of well defined scope, along with dynamic feedback, interactivity and rewards. Transcribed records, along with record-level and process metadata, are provided back to the institutions. While the tool is being developed with new users in mind, it can serve a broad range of needs from novice to trained museum specialist. Notes from Nature has the potential to speed the rate of biodiversity data being made available to a broad community of users.
Natural History Museums; Biodiversity; Open Source; Museum Collections; Citizen Science; Digitization; Transcription
Part diary, part scientific record, biological field notebooks often contain details necessary to understanding the location and environmental conditions existent during collecting events. Despite their clear value for (and recent use in) global change studies, the text-mining outputs from field notebooks have been idiosyncratic to specific research projects, and impossible to discover or re-use. Best practices and workflows for digitization, transcription, extraction, and integration with other sources are nascent or non-existent. In this paper, we demonstrate a workflow to generate structured outputs while also maintaining links to the original texts. The first step in this workflow was to place already digitized and transcribed field notebooks from the University of Colorado Museum of Natural History founder, Junius Henderson, on Wikisource, an open text transcription platform. Next, we created Wikisource templates to document places, dates, and taxa to facilitate annotation and wiki-linking. We then requested help from the public, through social media tools, to take advantage of volunteer efforts and energy. After three notebooks were fully annotated, content was converted into XML and annotations were extracted and cross-walked into Darwin Core compliant record sets. Finally, these recordsets were vetted, to provide valid taxon names, via a process we call “taxonomic referencing.” The result is identification and mobilization of 1,068 observations from three of Henderson’s thirteen notebooks and a publishable Darwin Core record set for use in other analyses. Although challenges remain, this work demonstrates a feasible approach to unlock observations from field notebooks that enhances their discovery and interoperability without losing the narrative context from which those observations are drawn.
“Compose your notes as if you were writing a letter to someone a century in the future.”
Perrine and Patton (2011)
Field notes; notebooks; crowd sourcing; digitization; biodiversity; transcription; text-mining; Darwin Core; Junius Henderson; annotation; taxonomic referencing; natural history; Wikisource; Colorado; species occurrence records
Here we present a standard developed by the Genomic Standards Consortium (GSC) for reporting marker gene sequences—the minimum information about a marker gene sequence (MIMARKS). We also introduce a system for describing the environment from which a biological sample originates. The ‘environmental packages’ apply to any genome sequence of known origin and can be used in combination with MIMARKS and other GSC checklists. Finally, to establish a unified standard for describing sequence data and to provide a single point of entry for the scientific community to access and learn about GSC checklists, we present the minimum information about any (x) sequence (MIxS). Adoption of MIxS will enhance our ability to analyze natural genetic diversity documented by massive DNA sequencing efforts from myriad ecosystems in our ever-changing biosphere.
Biodiversity data derive from myriad sources stored in various formats on many distinct hardware and software platforms. An essential step towards understanding global patterns of biodiversity is to provide a standardized view of these heterogeneous data sources to improve interoperability. Fundamental to this advance are definitions of common terms. This paper describes the evolution and development of Darwin Core, a data standard for publishing and integrating biodiversity information. We focus on the categories of terms that define the standard, differences between simple and relational Darwin Core, how the standard has been implemented, and the community processes that are essential for maintenance and growth of the standard. We present case-study extensions of the Darwin Core into new research communities, including metagenomics and genetic resources. We close by showing how Darwin Core records are integrated to create new knowledge products documenting species distributions and changes due to environmental perturbations.
The impacts of climate change on phenological responses of species and communities are well-documented; however, many such studies are correlational and so less effective at assessing the causal links between changes in climate and changes in phenology. Using grasshopper communities found along an elevational gradient, we present an ideal system along the Front Range of Colorado USA that provides a mechanistic link between climate and phenology.
This study utilizes past (1959–1960) and present (2006–2008) surveys of grasshopper communities and daily temperature records to quantify the relationship between amount and timing of warming across years and elevations, and grasshopper timing to adulthood. Grasshopper communities were surveyed at four sites, Chautauqua Mesa (1752 m), A1 (2195 m), B1 (2591 m), and C1 (3048 m), located in prairie, lower montane, upper montane, and subalpine life zones, respectively. Changes to earlier first appearance of adults depended on the degree to which a site warmed. The lowest site showed little warming and little phenological advancement. The next highest site (A1) warmed a small, but significant, amount and grasshopper species there showed inconsistent phenological advancements. The two highest sites warmed the most, and at these sites grasshoppers showed significant phenological advancements. At these sites, late-developing species showed the greatest advancements, a pattern that correlated with an increase in rate of late-season warming. The number of growing degree days (GDDs) associated with the time to adulthood for a species was unchanged across the past and present surveys, suggesting that phenological advancement depended on when a set number of GDDs is reached during a season.
Our analyses provide clear evidence that variation in amount and timing of warming over the growing season explains the vast majority of phenological variation in this system. Our results move past simple correlation and provide a stronger process-oriented and predictive framework for understanding community level phenological responses to climate change.
Responding to the urgent need to make biodiversity records broadly accessible, the natural history community turned to “the cloud.”
Increasing the quantity and quality of data is a key goal of biodiversity informatics, leading to increased fitness for use in scientific research and beyond. This goal is impeded by a legacy of geographic locality descriptions associated with biodiversity records that are often heterogeneous and not in a map-ready format. The biodiversity informatics community has developed best practices and tools that provide the means to do retrospective georeferencing (e.g., the BioGeomancer toolkit), a process that converts heterogeneous descriptions into geographic coordinates and a measurement of spatial uncertainty. Even with these methods and tools, data publishers are faced with the immensely time-consuming task of vetting georeferenced localities. Furthermore, it is likely that overlap in georeferencing effort is occurring across data publishers. Solutions are needed that help publishers more effectively georeference their records, verify their quality, and eliminate the duplication of effort across publishers.
We have developed a tool called BioGeoBIF, which incorporates the high throughput and standardized georeferencing methods of BioGeomancer into a beginning-to-end workflow. Custodians who publish their data to the Global Biodiversity Information Facility (GBIF) can use this system to improve the quantity and quality of their georeferences. BioGeoBIF harvests records directly from the publishers' access points, georeferences the records using the BioGeomancer web-service, and makes results available to data managers for inclusion at the source. Using a web-based, password-protected, group management system for each data publisher, we leave data ownership, management, and vetting responsibilities with the managers and collaborators of each data set. We also minimize the georeferencing task, by combining and storing unique textual localities from all registered data access points, and dynamically linking that information to the password protected record information for each publisher.
We have developed one of the first examples of services that can help create higher quality data for publishers mediated through the Global Biodiversity Information Facility and its data portal. This service is one step towards solving many problems of data quality in the growing field of biodiversity informatics. We envision future improvements to our service that include faster results returns and inclusion of more georeferencing engines.
With the quantity of genomic data increasing at an exponential rate, it is imperative that these data be captured electronically, in a standard format. Standardization activities must proceed within the auspices of open-access and international working bodies. To tackle the issues surrounding the development of better descriptions of genomic investigations, we have formed the Genomic Standards Consortium (GSC). Here, we introduce the minimum information about a genome sequence (MIGS) specification with the intent of promoting participation in its development and discussing the resources that will be required to develop improved mechanisms of metadata capture and exchange. As part of its wider goals, the GSC also supports improving the ‘transparency’ of the information contained in existing genomic databases.
Ecological niche models (ENMs) provide a means of characterizing the spatial distribution of suitable conditions for species, and have recently been applied to the challenge of locating potential distributional areas at the Last Glacial Maximum (LGM) when unfavorable climate conditions led to range contractions and fragmentation. Here, we compare and contrast ENM-based reconstructions of LGM refugial locations with those resulting from the more traditional molecular genetic and phylogeographic predictions. We examined 20 North American terrestrial vertebrate species from different regions and with different range sizes for which refugia have been identified based on phylogeographic analyses, using ENM tools to make parallel predictions. We then assessed the correspondence between the two approaches based on spatial overlap and areal extent of the predicted refugia. In 14 of the 20 species, the predictions from ENM and predictions based on phylogeographic studies were significantly spatially correlated, suggesting that the two approaches to development of refugial maps are converging on a similar result. Our results confirm that ENM scenario exploration can provide a useful complement to molecular studies, offering a less subjective, spatially explicit hypothesis of past geographic patterns of distribution.
The BioGeomancer Project provides a toolkit to georeference data and specimens collected for natural history collections, a crucial task if the potential of these specimens is to be fully realized.
Biodiversity data are rapidly becoming available over the Internet in common formats that promote sharing and exchange. Currently, these data are somewhat problematic, primarily with regard to geographic and taxonomic accuracy, for use in ecological research, natural resources management and conservation decision-making. However, web-based georeferencing tools that utilize best practices and gazetteer databases can be employed to improve geographic data. Taxonomic data quality can be improved through web-enabled valid taxon names databases and services, as well as more efficient mechanisms to return systematic research results and taxonomic misidentification rates back to the biodiversity community. Both of these are under construction. A separate but related challenge will be developing web-based visualization and analysis tools for tracking biodiversity change. Our aim was to discuss how such tools, combined with data of enhanced quality, will help transform today's portals to raw biodiversity data into nexuses of collaborative creation and sharing of biodiversity knowledge.
BioGeomancer; data visualization; Geographic Information Systems; Global Biodiversity Information Facility; global biodiversity services; Google Earth; species richness estimation; survey gap analysis