Search tips
Search criteria

Results 1-5 (5)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
1.  Applications of Natural Language Processing in Biodiversity Science 
Advances in Bioinformatics  2012;2012:391574.
Centuries of biological knowledge are contained in the massive body of scientific literature, written for human-readability but too big for any one person to consume. Large-scale mining of information from the literature is necessary if biology is to transform into a data-driven science. A computer can handle the volume but cannot make sense of the language. This paper reviews and discusses the use of natural language processing (NLP) and machine-learning algorithms to extract information from systematic literature. NLP algorithms have been used for decades, but require special development for application in the biological realm due to the special nature of the language. Many tools exist for biological information extraction (cellular processes, taxonomic names, and morphological characters), but none have been applied life wide and most still require testing and development. Progress has been made in developing algorithms for automated annotation of taxonomic text, identification of taxonomic names in text, and extraction of morphological character information from taxonomic descriptions. This manuscript will briefly discuss the key steps in applying information extraction tools to enhance biodiversity science.
PMCID: PMC3364545  PMID: 22685456
2.  Data issues in the life sciences 
ZooKeys  2011;15-51.
We review technical and sociological issues facing the Life Sciences as they transform into more data-centric disciplines - the “Big New Biology”. Three major challenges are: 1) lack of comprehensive standards; 2) lack of incentives for individual scientists to share data; 3) lack of appropriate infrastructure and support. Technological advances with standards, bandwidth, distributed computing, exemplar successes, and a strong presence in the emerging world of Linked Open Data are sufficient to conclude that technical issues will be overcome in the foreseeable future. While motivated to have a shared open infrastructure and data pool, and pressured by funding agencies in move in this direction, the sociological issues determine progress. Major sociological issues include our lack of understanding of the heterogeneous data cultures within Life Sciences, and the impediments to progress include a lack of incentives to build appropriate infrastructures into projects and institutions or to encourage scientists to make data openly available.
PMCID: PMC3234430  PMID: 22207805
life science; informatics; data issues; standards; incentives; escience
3.  A statistical assessment of population trends for data deficient Mexican amphibians 
PeerJ  2014;2:e703.
Background. Mexico has the world’s fifth largest population of amphibians and the second country with the highest quantity of threatened amphibian species. About 10% of Mexican amphibians lack enough data to be assigned to a risk category by the IUCN, so in this paper we want to test a statistical tool that, in the absence of specific demographic data, can assess a species’ risk of extinction, population trend, and to better understand which variables increase their vulnerability. Recent studies have demonstrated that the risk of species decline depends on extrinsic and intrinsic traits, thus including both of them for assessing extinction might render more accurate assessment of threats.
Methods. We harvested data from the Encyclopedia of Life (EOL) and the published literature for Mexican amphibians, and used these data to assess the population trend of some of the Mexican species that have been assigned to the Data Deficient category of the IUCN using Random Forests, a Machine Learning method that gives a prediction of complex processes and identifies the most important variables that account for the predictions.
Results. Our results show that most of the data deficient Mexican amphibians that we used have decreasing population trends. We found that Random Forests is a solid way to identify species with decreasing population trends when no demographic data is available. Moreover, we point to the most important variables that make species more vulnerable for extinction. This exercise is a very valuable first step in assigning conservation priorities for poorly known species.
PMCID: PMC4273930  PMID: 25548736
Mexican amphibians; Statistical assessment; Encyclopedia of life; Random forests; Data harvest; IUCN categories
4.  Knowledge Extraction and Semantic Annotation of Text from the Encyclopedia of Life 
PLoS ONE  2014;9(3):e89550.
Numerous digitization and ontological initiatives have focused on translating biological knowledge from narrative text to machine-readable formats. In this paper, we describe two workflows for knowledge extraction and semantic annotation of text data objects featured in an online biodiversity aggregator, the Encyclopedia of Life. One workflow tags text with DBpedia URIs based on keywords. Another workflow finds taxon names in text using GNRD for the purpose of building a species association network. Both workflows work well: the annotation workflow has an F1 Score of 0.941 and the association algorithm has an F1 Score of 0.885. Existing text annotators such as Terminizer and DBpedia Spotlight performed well, but require some optimization to be useful in the ecology and evolution domain. Important future work includes scaling up and improving accuracy through the use of distributional semantics.
PMCID: PMC3940440  PMID: 24594988
5.  The Taxonomic Significance of Species That Have Only Been Observed Once: The Genus Gymnodinium (Dinoflagellata) as an Example 
PLoS ONE  2012;7(8):e44015.
Taxonomists have been tasked with cataloguing and quantifying the Earth’s biodiversity. Their progress is measured in code-compliant species descriptions that include text, images, type material and molecular sequences. It is from this material that other researchers are to identify individuals of the same species in future observations. It has been estimated that 13% to 22% (depending on taxonomic group) of described species have only ever been observed once. Species that have only been observed at the time and place of their original description are referred to as oncers. Oncers are important to our current understanding of biodiversity. They may be validly described species that are members of a rare biosphere, or they may indicate endemism, or that these species are limited to very constrained niches. Alternatively, they may reflect that taxonomic practices are too poor to allow the organism to be re-identified or that the descriptions are unknown to other researchers. If the latter are true, our current tally of species will not be an accurate indication of what we know. In order to investigate this phenomenon and its potential causes, we examined the microbial eukaryote genus Gymnodinium. This genus contains 268 extant species, 103 (38%) of which have not been observed since their original description. We report traits of the original descriptions and interpret them in respect to the status of the species. We conclude that the majority of oncers were poorly described and their identity is ambiguous. As a result, we argue that the genus Gymnodinium contains only 234 identifiable species. Species that have been observed multiple times tend to have longer descriptions, written in English. The styles of individual authors have a major effect, with a few authors describing a disproportionate number of oncers. The information about the taxonomy of Gymnodinium that is available via the internet is incomplete, and reliance on it will not give access to all necessary knowledge. Six new names are presented – Gymnodinium campbelli for the homonymous name Gymnodinium translucens Campbell 1973, Gymnodinium antarcticum for the homonymous name Gymnodinium frigidum Balech 1965, Gymnodinium manchuriensis for the homonymous name Gymnodinium autumnale Skvortzov 1968, Gymnodinium christenum for the homonymous name Gymnodinium irregulare Christen 1959, Gymnodinium conkufferi for the homonymous name Gymnodinium irregulare Conrad & Kufferath 1954 and Gymnodinium chinensis for the homonymous name Gymnodinium frigidum Skvortzov 1968.
PMCID: PMC3431360  PMID: 22952856

Results 1-5 (5)