PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (63)
 

Clipboard (0)
None

Select a Filter Below

Year of Publication
more »
1.  The impact of taxon sampling on phylogenetic inference: a review of two decades of controversy 
Briefings in Bioinformatics  2011;13(1):122-134.
Over the past two decades, there has been a long-standing debate about the impact of taxon sampling on phylogenetic inference. Studies have been based on both real and simulated data sets, within actual and theoretical contexts, and using different inference methods, to study the impact of taxon sampling. In some cases, conflicting conclusions have been drawn for the same data set. The main questions explored in studies to date have been about the effects of using sparse data, adding new taxa, including more characters from genome sequences and using different (or concatenated) locus regions. These questions can be reduced to more fundamental ones about the assessment of data quality and the design guidelines of taxon sampling in phylogenetic inference experiments. This review summarizes progress to date in understanding the impact of taxon sampling on the accuracy of phylogenetic analysis.
doi:10.1093/bib/bbr014
PMCID: PMC3251835  PMID: 21436145
Phylogenetics; taxonomic sampling; bioinformatics
2.  Bioinformatics opportunities for identification and study of medicinal plants 
Briefings in Bioinformatics  2012;14(2):238-250.
Plants have been used as a source of medicine since historic times and several commercially important drugs are of plant-based origin. The traditional approach towards discovery of plant-based drugs often times involves significant amount of time and expenditure. These labor-intensive approaches have struggled to keep pace with the rapid development of high-throughput technologies. In the era of high volume, high-throughput data generation across the biosciences, bioinformatics plays a crucial role. This has generally been the case in the context of drug designing and discovery. However, there has been limited attention to date to the potential application of bioinformatics approaches that can leverage plant-based knowledge. Here, we review bioinformatics studies that have contributed to medicinal plants research. In particular, we highlight areas in medicinal plant research where the application of bioinformatics methodologies may result in quicker and potentially cost-effective leads toward finding plant-based remedies.
doi:10.1093/bib/bbs021
PMCID: PMC3603214  PMID: 22589384
medicinal plants; bioinformatics; drug discovery
3.  A vector space model approach to identify genetically related diseases 
Objective
The relationship between diseases and their causative genes can be complex, especially in the case of polygenic diseases. Further exacerbating the challenges in their study is that many genes may be causally related to multiple diseases. This study explored the relationship between diseases through the adaptation of an approach pioneered in the context of information retrieval: vector space models.
Materials and Methods
A vector space model approach was developed that bridges gene disease knowledge inferred across three knowledge bases: Online Mendelian Inheritance in Man, GenBank, and Medline. The approach was then used to identify potentially related diseases for two target diseases: Alzheimer disease and Prader-Willi Syndrome.
Results
In the case of both Alzheimer Disease and Prader-Willi Syndrome, a set of plausible diseases were identified that may warrant further exploration.
Discussion
This study furthers seminal work by Swanson, et al. that demonstrated the potential for mining literature for putative correlations. Using a vector space modeling approach, information from both biomedical literature and genomic resources (like GenBank) can be combined towards identification of putative correlations of interest. To this end, the relevance of the predicted diseases of interest in this study using the vector space modeling approach were validated based on supporting literature.
Conclusion
The results of this study suggest that a vector space model approach may be a useful means to identify potential relationships between complex diseases, and thereby enable the coordination of gene-based findings across multiple complex diseases.
doi:10.1136/amiajnl-2011-000480
PMCID: PMC3277619  PMID: 22227640
Translational bioinformatics; vector space modeling; information retrieval; complex diseases; knowledge representation; bioinformatics; phylogenetics
4.  MESHING MOLECULAR SEQUENCES AND CLINICAL TRIALS: A FEASIBILITY STUDY 
Journal of biomedical informatics  2009;43(3):442-450.
The centralized and public availability of molecular sequence and clinical trial data presents an opportunity to identify potentially valuable linkages across the bench-to-bedside “T1” translational barrier. In this study, we sought to leverage keyword metadata (Medical Subject Heading [MeSH] descriptors) to infer relationships between molecular sequences and clinical trials, as indexed by GenBank and ClinicalTrials.gov. The results of this feasibility study found that approximately 30% of sequences in GenBank could be linked to trials and over 90% of trials in ClinicalTrials.gov could be linked to sequences through MeSH descriptors. In a cursory evaluation, we were able to consistently identify meaningful linkages between molecular sequences and clinical trials. Based on our findings, there may be promise in subsequent studies aiming to identify linkages across the T1 translational barrier using existing large repositories.
doi:10.1016/j.jbi.2009.10.003
PMCID: PMC2878930  PMID: 19850150
GenBank; ClinicalTrials.gov; PubMed/MEDLINE; MeSH; bench-to-bedside; translational bioinformatics; metadata analysis
5.  Biodiversity Informatics: the emergence of a field 
BMC Bioinformatics  2009;10(Suppl 14):S1.
doi:10.1186/1471-2105-10-S14-S1
PMCID: PMC2775146  PMID: 19900296
6.  A Retrospective Approach to Testing the DNA Barcoding Method 
PLoS ONE  2013;8(11):e77882.
A decade ago, DNA barcoding was proposed as a standardised method for identifying existing species and speeding the discovery of new species. Yet, despite its numerous successes across a range of taxa, its frequent failures have brought into question its accuracy as a short-cut taxonomic method. We use a retrospective approach, applying the method to the classification of New Zealand skinks as it stood in 1977 (primarily based upon morphological characters), and compare it to the current taxonomy reached using both morphological and molecular approaches. For the 1977 dataset, DNA barcoding had moderate-high success in identifying specimens (78-98%), and correctly flagging specimens that have since been confirmed as distinct taxa (77-100%). But most matching methods failed to detect the species complexes that were present in 1977. For the current dataset, there was moderate-high success in identifying specimens (53-99%). For both datasets, the capacity to discover new species was dependent on the methodological approach used. Species delimitation in New Zealand skinks was hindered by the absence of either a local or global barcoding gap, a result of recent speciation events and hybridisation. Whilst DNA barcoding is potentially useful for specimen identification and species discovery in New Zealand skinks, its error rate could hinder the progress of documenting biodiversity in this group. We suggest that integrated taxonomic approaches are more effective at discovering and describing biodiversity.
doi:10.1371/journal.pone.0077882
PMCID: PMC3823873  PMID: 24244283
7.  Origins of amyloid-β 
BMC Genomics  2013;14:290.
Background
Amyloid-β plaques are a defining characteristic of Alzheimer Disease. However, Amyloid-β deposition is also found in other forms of dementia and in non-pathological contexts. Amyloid-β deposition is variable among vertebrate species and the evolutionary emergence of the amyloidogenic property is currently unknown. Evolutionary persistence of a pathological peptide sequence may depend on the functions of the precursor gene, conservation or mutation of nucleotides or peptide domains within the precursor gene, or a species-specific physiological environment.
Results
In this study, we asked when amyloidogenic Amyloid-β first arose using phylogenetic trees constructed for the Amyloid-β Precursor Protein gene family and by modeling the potential for Amyloid-β aggregation across species in silico. We collected the most comprehensive set of sequences for the Amyloid-β Precursor Protein family using an automated, iterative meta-database search and constructed a highly resolved phylogeny. The analysis revealed that the ancestral gene for invertebrate and vertebrate Amyloid-β Precursor Protein gene families arose around metazoic speciation during the Ediacaran period. Synapomorphic frequencies found domain-specific conservation of sequence. Analyses of aggregation potential showed that potentially amyloidogenic sequences are a ubiquitous feature of vertebrate Amyloid-β Precursor Protein but are also found in echinoderm, nematode, and cephalochordate, and hymenoptera species homologues.
Conclusions
The Amyloid-β Precursor Protein gene is ancient and highly conserved. The amyloid forming Amyloid-β domains may have been present in early deuterostomes, but more recent mutations appear to have resulted in potentially unrelated amyoid forming sequences. Our results further highlight that the species-specific physiological environment is as critical to Amyloid-β formation as the peptide sequence.
doi:10.1186/1471-2164-14-290
PMCID: PMC3660159  PMID: 23627794
Amyloid; Alzheimer disease; Phylogenetics; In silico; Aggregation; Maximum parsimony; Bayesian inference
8.  People, organizational, and leadership factors impacting informatics support for clinical and translational research 
Background
In recent years, there have been numerous initiatives undertaken to describe critical information needs related to the collection, management, analysis, and dissemination of data in support of biomedical research (J Investig Med 54:327-333, 2006); (J Am Med Inform Assoc 16:316–327, 2009); (Physiol Genomics 39:131-140, 2009); (J Am Med Inform Assoc 18:354–357, 2011). A common theme spanning such reports has been the importance of understanding and optimizing people, organizational, and leadership factors in order to achieve the promise of efficient and timely research (J Am Med Inform Assoc 15:283–289, 2008). With the emergence of clinical and translational science (CTS) as a national priority in the United States, and the corresponding growth in the scale and scope of CTS research programs, the acuity of such information needs continues to increase (JAMA 289:1278–1287, 2003); (N Engl J Med 353:1621–1623, 2005); (Sci Transl Med 3:90, 2011). At the same time, systematic evaluations of optimal people, organizational, and leadership factors that influence the provision of data, information, and knowledge management technologies and methods are notably lacking.
Methods
In response to the preceding gap in knowledge, we have conducted both: 1) a structured survey of domain experts at Academic Health Centers (AHCs); and 2) a subsequent thematic analysis of public-domain documentation provided by those same organizations. The results of these approaches were then used to identify critical factors that may influence access to informatics expertise and resources relevant to the CTS domain.
Results
A total of 31 domain experts, spanning the Biomedical Informatics (BMI), Computer Science (CS), Information Science (IS), and Information Technology (IT) disciplines participated in a structured surveyprocess. At a high level, respondents identified notable differences in theaccess to BMI, CS, and IT expertise and services depending on the establishment of a formal BMI academic unit and the perceived relationship between BMI, CS, IS, and IT leaders. Subsequent thematic analysis of the aforementioned public domain documents demonstrated a discordance between perceived and reported integration across and between BMI, CS, IS, and IT programs and leaders with relevance to the CTS domain.
Conclusion
Differences in people, organization, and leadership factors do influence the effectiveness of CTS programs, particularly with regard to the ability to access and leverage BMI, CS, IS, and IT expertise and resources. Based on this finding, we believe that the development of a better understanding of how optimal BMI, CS, IS, and IT organizational structures and leadership models are designed and implemented is critical to both the advancement of CTS and ultimately, to improvements in the quality, safety, and effectiveness of healthcare.
doi:10.1186/1472-6947-13-20
PMCID: PMC3577661  PMID: 23388243
9.  Data Analysis and Data Mining: Current Issues in Biomedical Informatics 
Summary
Background
Medicine and biomedical sciences have become data-intensive fields, which, at the same time, enable the application of data-driven approaches and require sophisticated data analysis and data mining methods. Biomedical informatics provides a proper interdisciplinary context to integrate data and knowledge when processing available information, with the aim of giving effective decision-making support in clinics and translational research.
Objectives
To reflect on different perspectives related to the role of data analysis and data mining in biomedical informatics.
Methods
On the occasion of the 50th year of Methods of Information in Medicine a symposium was organized, that reflected on opportunities, challenges and priorities of organizing, representing and analysing data, information and knowledge in biomedicine and health care. The contributions of experts with a variety of backgrounds in the area of biomedical data analysis have been collected as one outcome of this symposium, in order to provide a broad, though coherent, overview of some of the most interesting aspects of the field.
Results
The paper presents sections on data accumulation and data-driven approaches in medical informatics, data and knowledge integration, statistical issues for the evaluation of data mining models, translational bioinformatics and bioinformatics aspects of genetic epidemiology.
Conclusions
Biomedical informatics represents a natural framework to properly and effectively apply data analysis and data mining methods in a decision-making context. In the future, it will be necessary to preserve the inclusive nature of the field and to foster an increasing sharing of data and methods between researchers.
doi:10.3414/ME11-06-0002
PMCID: PMC3233983  PMID: 22146916
Biomedical informatics; data mining; data analysis; data-driven methods; translational bioinformatics
10.  Enhancing Phylogeography by Improving Geographical Information from GenBank 
Journal of biomedical informatics  2011;44(Suppl 1):S44-S47.
Phylogeography is a field that focuses on the geographical lineages of species such as vertebrates or viruses. Here, geographical data, such as location of a species or viral host is as important as the sequence information extracted from the species. Together, this information can help illustrate the migration of the species over time within a geographical area, the impact of geography over the evolutionary history, or the expected population of the species within the area. Molecular sequence data from NCBI, specifically GenBank, provide an abundance of available sequence data for phylogeography. However, geographical data is inconsistently represented and sparse across GenBank entries. This can impede analysis and in situations where the geographical information is inferred, and potentially lead to erroneous results. In this paper, we describe the current state of geographical data in GenBank, and illustrate how automated processing techniques such as named entity recognition, can enhance the geographical data available for phylogeographic studies.
doi:10.1016/j.jbi.2011.06.005
PMCID: PMC3199023  PMID: 21723960
Phylogeography; Databases; Nucleic Acid; Geographic Locations; Bioinformatics
11.  Mining Disease Fingerprints From Within Genetic Pathways 
AMIA Annual Symposium Proceedings  2012;2012:1320-1329.
Mining biological networks can be an effective means to uncover system level knowledge out of micro level associations, such as encapsulated in genetic pathways. Analysis of human disease genetic pathways can lead to the identification of major mechanisms that may underlie disorders at an abstract functional level. The focus of this study was to develop an approach for structural pattern analysis and classification of genetic pathways of diseases. A probabilistic model was developed to capture characteristic components (‘fingerprints’) of functionally annotated pathways. A probability estimation procedure of this model searched for fingerprints in each disease pathway while improving probability estimates of model parameters. The approach was evaluated on data from the Kyoto Encyclopedia of Genes and Genomes (consisting of 56 pathways across seven disease categories). Based on the achieved average classification accuracy of up to ∼77%, the findings suggest that these fingerprints may be used for classification and discovery of genetic pathways.
PMCID: PMC3540421  PMID: 23304411
12.  Social and Behavioral History Information in Public Health Datasets 
Social and behavioral history is increasingly recognized as integral for understanding important determinants of disease and critical for patient care, research, clinical guidelines, and public health policies. Social and behavioral history information in the public health domain, specifically large public health surveys, has not been well described. In this study, a content analysis was performed and information model constructed and contrasted with clinically-based models for each of three widely used public health surveys: BRFSS (Behavioral Risk Factor Surveillance System), NHANES (National Health and Nutrition Examination Survey), and NHIS (National Health Interview Survey). Survey items were predominantly related to alcohol use, drug use, occupation, and tobacco use. Although the clinical social history information model was similar, public health social history demonstrated additional complexity in coding temporality, degree of exposure, and certainty. Our results give insight into ongoing efforts to integrate clinical and public health information resources for improving and measuring health.
PMCID: PMC3540479  PMID: 23304335
13.  Characterizing the Use and Contents of Free-Text Family History Comments in the Electronic Health Record 
The detailed collection of family history information is becoming increasingly important for patient care and biomedical research. Recent reports have highlighted the need for efforts to better understand collection and use of this information in resources such as the Electronic Health Record (EHR). This two-part study involved characterizing the use and contents of free-text comments within the family history section of an EHR. Based on a manual review of a subset of 11,456 cancer-related family history entries, 20 “reasons for use” were identified and the distribution across these reasons determined. A semi-automated analysis of the 3,358 unique comments associated with these entries was then performed to identify and quantify key categories of information. Implications of this study include guiding efforts for the improved use, collection, and subsequent analysis of family history information in the EHR.
PMCID: PMC3540518  PMID: 23304276
14.  Determining Compound Comorbidities for Heart Failure from Hospital Discharge Data 
The course of treatment and ultimate clinical outcome often depends on a holistic understanding of the patient status, which often requires cataloguing of concomitant conditions (“comorbidities”). A number of approaches have been developed to quantify the effect of comorbidities (e.g., the Charlson Comorbidity Index); however, reported metrics have been based on pair-wise analyses of co-occurring conditions. This study explored the potential to develop “compound co-morbidities” (CCMs) as a knowledge construct to represent multiple comorbidities, which accommodates for relative prevalence, statistical significance, and rate of increased cost. In the context of congestive heart failure, which is a leading cause for hospital admissions nationally (particularly for the elderly), CCMs were developed and analyzed based on hospital discharge data for an entire state population (Vermont). The results suggest that CCMs may be a valuable construct for characterizing complex co-morbidity relationships that may not be captured using conventional pair-wise approaches.
PMCID: PMC3540553  PMID: 23304355
15.  Molecular Approach to the Identification of Fish in the South China Sea 
PLoS ONE  2012;7(2):e30621.
Background
DNA barcoding is one means of establishing a rapid, accurate, and cost-effective system for the identification of species. It involves the use of short, standard gene targets to create sequence profiles of known species against sequences of unknowns that can be matched and subsequently identified. The Fish Barcode of Life (FISH-BOL) campaign has the primary goal of gathering DNA barcode records for all the world's fish species. As a contribution to FISH-BOL, we examined the degree to which DNA barcoding can discriminate marine fishes from the South China Sea.
Methodology/Principal Findings
DNA barcodes of cytochrome oxidase subunit I (COI) were characterized using 1336 specimens that belong to 242 species fishes from the South China Sea. All specimen provenance data (including digital specimen images and geospatial coordinates of collection localities) and collateral sequence information were assembled using Barcode of Life Data System (BOLD; www.barcodinglife.org). Small intraspecific and large interspecific differences create distinct genetic boundaries among most species. In addition, the efficiency of two mitochondrial genes, 16S rRNA (16S) and cytochrome b (cytb), and one nuclear ribosomal gene, 18S rRNA (18S), was also evaluated for a few select groups of species.
Conclusions/Significance
The present study provides evidence for the effectiveness of DNA barcoding as a tool for monitoring marine biodiversity. Open access data of fishes from the South China Sea can benefit relative applications in ecology and taxonomy.
doi:10.1371/journal.pone.0030621
PMCID: PMC3281855  PMID: 22363454
16.  Small-Molecule Inhibition of Choline Catabolism in Pseudomonas aeruginosa and Other Aerobic Choline-Catabolizing Bacteria ▿ †  
Applied and Environmental Microbiology  2011;77(13):4383-4389.
Choline is abundant in association with eukaryotes and plays roles in osmoprotection, thermoprotection, and membrane biosynthesis in many bacteria. Aerobic catabolism of choline is widespread among soil proteobacteria, particularly those associated with eukaryotes. Catabolism of choline as a carbon, nitrogen, and/or energy source may play important roles in association with eukaryotes, including pathogenesis, symbioses, and nutrient cycling. We sought to generate choline analogues to study bacterial choline catabolism in vitro and in situ. Here we report the characterization of a choline analogue, propargylcholine, which inhibits choline catabolism at the level of Dgc enzyme-catalyzed dimethylglycine demethylation in Pseudomonas aeruginosa. We used genetic analyses and 13C nuclear magnetic resonance to demonstrate that propargylcholine is catabolized to its inhibitory form, propargylmethylglycine. Chemically synthesized propargylmethylglycine was also an inhibitor of growth on choline. Bioinformatic analysis suggests that there are genes encoding DgcA homologues in a variety of proteobacteria. We examined the broader utility of propargylcholine and propargylmethylglycine by assessing growth of other members of the proteobacteria that are known to grow on choline and possess putative DgcA homologues. Propargylcholine showed utility as a growth inhibitor in P. aeruginosa but did not inhibit growth in other proteobacteria tested. In contrast, propargylmethylglycine was able to inhibit choline-dependent growth in all tested proteobacteria, including Pseudomonas mendocina, Pseudomonas fluorescens, Pseudomonas putida, Burkholderia cepacia, Burkholderia ambifaria, and Sinorhizobium meliloti. We predict that chemical inhibitors of choline catabolism will be useful for studying this pathway in clinical and environmental isolates and could be a useful tool to study proteobacterial choline catabolism in situ.
doi:10.1128/AEM.00504-11
PMCID: PMC3127689  PMID: 21602374
19.  Generation, Annotation and Analysis of First Large-Scale Expressed Sequence Tags from Developing Fiber of Gossypium barbadense L 
PLoS ONE  2011;6(7):e22758.
Background
Cotton fiber is the world's leading natural fiber used in the manufacture of textiles. Gossypium is also the model plant in the study of polyploidization, evolution, cell elongation, cell wall development, and cellulose biosynthesis. G. barbadense L. is an ideal candidate for providing new genetic variations useful to improve fiber quality for its superior properties. However, little is known about fiber development mechanisms of G. barbadense and only a few molecular resources are available in GenBank.
Methodology and Principal Findings
In total, 10,979 high-quality expressed sequence tags (ESTs) were generated from a normalized fiber cDNA library of G. barbadense. The ESTs were clustered and assembled into 5852 unigenes, consisting of 1492 contigs and 4360 singletons. The blastx result showed 2165 unigenes with significant similarity to known genes and 2687 unigenes with significant similarity to genes of predicted proteins. Functional classification revealed that unigenes were abundant in the functions of binding, catalytic activity, and metabolic pathways of carbohydrate, amino acid, energy, and lipids. The function motif/domain-related cytoskeleton and redox homeostasis were enriched. Among the 5852 unigenes, 282 and 736 unigenes were identified as potential cell wall biosynthesis and transcription factors, respectively. Furthermore, the relationships among cotton species or between cotton and other model plant systems were analyzed. Some putative species-specific unigenes of G. barbadense were highlighted.
Conclusions/Significance
The ESTs generated in this study are from the first large-scale EST project for G. barbadense and significantly enhance the number of G. barbadense ESTs in public databases. This knowledge will contribute to cotton improvements by studying fiber development mechanisms of G. barbadense, establishing a breeding program using marker-assisted selection, and discovering candidate genes related to important agronomic traits of cotton through oligonucleotide array. Our work will also provide important resources for comparative genomics, polyploidization, and genome evolution among Gossypium species.
doi:10.1371/journal.pone.0022758
PMCID: PMC3145671  PMID: 21829504
20.  The Barcode of Life Data Portal: Bridging the Biodiversity Informatics Divide for DNA Barcoding 
PLoS ONE  2011;6(7):e14689.
With the volume of molecular sequence data that is systematically being generated globally, there is a need for centralized resources for data exploration and analytics. DNA Barcode initiatives are on track to generate a compendium of molecular sequence–based signatures for identifying animals and plants. To date, the range of available data exploration and analytic tools to explore these data have only been available in a boutique form—often representing a frustrating hurdle for many researchers that may not necessarily have resources to install or implement algorithms described by the analytic community. The Barcode of Life Data Portal (BDP) is a first step towards integrating the latest biodiversity informatics innovations with molecular sequence data from DNA barcoding. Through establishment of community driven standards, based on discussion with the Data Analysis Working Group (DAWG) of the Consortium for the Barcode of Life (CBOL), the BDP provides an infrastructure for incorporation of existing and next-generation DNA barcode analytic applications in an open forum.
doi:10.1371/journal.pone.0014689
PMCID: PMC3144886  PMID: 21818249
21.  Translational bioinformatics: linking knowledge across biological and clinical realms 
Nearly a decade since the completion of the first draft of the human genome, the biomedical community is positioned to usher in a new era of scientific inquiry that links fundamental biological insights with clinical knowledge. Accordingly, holistic approaches are needed to develop and assess hypotheses that incorporate genotypic, phenotypic, and environmental knowledge. This perspective presents translational bioinformatics as a discipline that builds on the successes of bioinformatics and health informatics for the study of complex diseases. The early successes of translational bioinformatics are indicative of the potential to achieve the promise of the Human Genome Project for gaining deeper insights to the genetic underpinnings of disease and progress toward the development of a new generation of therapies.
doi:10.1136/amiajnl-2011-000245
PMCID: PMC3128415  PMID: 21561873
Translational bioinformatics; systems medicine; systems biology; bioinformatics; biomedical informatics; knowledge representation; information retrieval; phylogenetics; modeling physiologic and disease processes; linking the genotype and phenotype; identifying genome and protein structure and function; visualization of data and knowledge; simulation of complex systems (at all levels: molecules to work groups to organizations); knowledge representations; uncertain reasoning and decision theory; languages; computational methods; statistical analysis of large datasets; advanced algorithms; discovery; text and data mining methods; natural-language processing; automated learning; ontologies
22.  Evaluation of family history information within clinical documents and adequacy of HL7 clinical statement and clinical genomics family history models for its representation: a case report 
Family history information has emerged as an increasingly important tool for clinical care and research. While recent standards provide for structured entry of family history, many clinicians record family history data in text. The authors sought to characterize family history information within clinical documents to assess the adequacy of existing models and create a more comprehensive model for its representation. Models were evaluated on 100 documents containing 238 sentences and 410 statements relevant to family history. Most statements were of family member plus disease or of disease only. Statement coverage was 91%, 77%, and 95% for HL7 Clinical Genomics Family History Model, HL7 Clinical Statement Model, and the newly created Merged Family History Model, respectively. Negation (18%) and inexact family member specification (9.5%) occurred commonly. Overall, both HL7 models could represent most family history statements in clinical reports; however, refinements are needed to represent the full breadth of family history data.
doi:10.1136/jamia.2009.002238
PMCID: PMC2995709  PMID: 20442153
23.  Towards Structuring Unstructured GenBank Metadata for Enhancing Comparative Biological Studies 
Within large sequence repositories such as GenBank there is a wealth of metadata providing contextual information that may enhance search and retrieval of relevant sequences for a range of subsequent analyses. One challenge is the use of free-text in these metadata fields where approaches are needed to extract, structure, and encode essential information. The goal of the present study was to explore the feasibility of using a combination of existing resources for annotating unstructured GenBank metadata, initially focusing on the “host” and “isolation_source” fields. This paper summarizes early results for 10 host organisms that include a characterization of associated isolation sources with respect to biomedical ontologies and semantic types. The findings from this preliminary study provide insights to the rich amount of information captured within these unstructured metadata, guidance for addressing the challenges and issues encountered, and highlight the potential value for enriching comparative biological studies towards improving human health.
PMCID: PMC3248757  PMID: 22211174
24.  A Multi-Site Content Analysis of Social History Information in Clinical Notes 
Within Electronic Health Records (EHRs), the social history section contains information relevant to social, behavioral, and environmental determinants of health. While social history is playing an increasingly important role in patient care, biomedical research, and public health, little analysis has been done to describe content in the EHR or the adequacy of existing standards for representing this information. In this study, social history sections from 260 clinical notes containing 989 sentences and 1,439 statements were analyzed from three sources. In total, 35 statement types were identified along with categories of information within statements for each type. For the 8 most common types, HL7 CDA and openEHR were found to provide different representations capable of capturing the breadth and granularity of information to some extent. The results of this study provide valuable insights for guiding efforts in the enhanced collection, standardization, and use of social history information in the EHR.
PMCID: PMC3243209  PMID: 22195074
25.  GENESTRACE: PHENOMIC KNOWLEDGE DISCOVERY VIA STRUCTURED TERMINOLOGY 
The era of applied genomic medicine is quickly approaching accompanied by the increasing availability of detailed genetic information. Understanding the genetic etiology behind complex, multi-gene diseases remains an important challenge. In order to uncover the putative genetic etiology of complex diseases, we designed a method that explores the relationships between two major terminological and ontological resources: the Unified Medical Language System (UMLS) and the Gene Ontology (GO). The UMLS has a mainly clinical emphasis; Gene Ontology (GO) has become the standard for biological annotations of genes and gene products. Using statistical and semantic relationships within and between the two resources, we are able to infer relationships between disease concepts in the UMLS and gene products annotated using GO and its associated databases. We validated our inferences by comparing them to the known gene-disease relationships, as defined in the Online Mendelian Inheritance in Man1s morbidmap. The proof-of-concept methods presented here are unique in that they bypass the ambiguity of the direct extraction of gene or disease term from MEDLINE. Additionally, our methods provide direct links to clinically significant diseases through established terminologies or ontologies. The preliminary results presented here indicate the potential utility of exploiting the existing, manually curated relationships in biomedical resources as a tool for the discovery of potentially valuable new gene-disease relationships. The GenesTrace system may be accessed at the following URL: http://phene .cpmc.columbia.edu:8080/genesTrace/index.jsp
PMCID: PMC2894422  PMID: 15759618

Results 1-25 (63)