Search tips
Search criteria

Results 1-22 (22)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
more »
1.  Phenoscape: Identifying Candidate Genes for Evolutionary Phenotypes 
Phenotypes resulting from mutations in genetic model organisms can help reveal candidate genes for evolutionarily important phenotypic changes in related taxa. Although testing candidate gene hypotheses experimentally in nonmodel organisms is typically difficult, ontology-driven information systems can help generate testable hypotheses about developmental processes in experimentally tractable organisms. Here, we tested candidate gene hypotheses suggested by expert use of the Phenoscape Knowledgebase, specifically looking for genes that are candidates responsible for evolutionarily interesting phenotypes in the ostariophysan fishes that bear resemblance to mutant phenotypes in zebrafish. For this, we searched ZFIN for genetic perturbations that result in either loss of basihyal element or loss of scales phenotypes, because these are the ancestral phenotypes observed in catfishes (Siluriformes). We tested the identified candidate genes by examining their endogenous expression patterns in the channel catfish, Ictalurus punctatus. The experimental results were consistent with the hypotheses that these features evolved through disruption in developmental pathways at, or upstream of, brpf1 and eda/edar for the ancestral losses of basihyal element and scales, respectively. These results demonstrate that ontological annotations of the phenotypic effects of genetic alterations in model organisms, when aggregated within a knowledgebase, can be used effectively to generate testable, and useful, hypotheses about evolutionary changes in morphology.
PMCID: PMC4693980  PMID: 26500251
molecular evolution; gene expression; evolutionary phenotypes; catfish; nonmodel organism
2.  Annotation of phenotypic diversity: decoupling data curation and ontology curation using Phenex 
Phenex ( is a desktop application for semantically annotating the phenotypic character matrix datasets common in evolutionary biology. Since its initial publication, we have added new features that address several major bottlenecks in the efficiency of the phenotype curation process: allowing curators during the data curation phase to provisionally request terms that are not yet available from a relevant ontology; supporting quality control against annotation guidelines to reduce later manual review and revision; and enabling the sharing of files for collaboration among curators.
We decoupled data annotation from ontology development by creating an Ontology Request Broker (ORB) within Phenex. Curators can use the ORB to request a provisional term for use in data annotation; the provisional term can be automatically replaced with a permanent identifier once the term is added to an ontology. We added a set of annotation consistency checks to prevent common curation errors, reducing the need for later correction. We facilitated collaborative editing by improving the reliability of Phenex when used with online folder sharing services, via file change monitoring and continual autosave.
With the addition of these new features, and in particular the Ontology Request Broker, Phenex users have been able to focus more effectively on data annotation. Phenoscape curators using Phenex have reported a smoother annotation workflow, with much reduced interruptions from ontology maintenance and file management issues.
PMCID: PMC4236444  PMID: 25411634
Annotation; Phenotypes; Ontology; Curation; Systematics; Character matrix
3.  Open data for evolutionary synthesis: an introduction to the NESCent collection 
Scientific Data  2014;1:140030.
PMCID: PMC4322567  PMID: 25977787
4.  The Standing Pool of Genomic Structural Variation in a Natural Population of Mimulus guttatus 
Genome Biology and Evolution  2013;6(1):53-64.
Major unresolved questions in evolutionary genetics include determining the contributions of different mutational sources to the total pool of genetic variation in a species, and understanding how these different forms of genetic variation interact with natural selection. Recent work has shown that structural variants (SVs) (insertions, deletions, inversions, and transpositions) are a major source of genetic variation, often outnumbering single nucleotide variants in terms of total bases affected. Despite the near ubiquity of SVs, major questions about their interaction with natural selection remain. For example, how does the allele frequency spectrum of SVs differ when compared with single nucleotide variants? How often do SVs affect genes, and what are the consequences? To begin to address these questions, we have systematically identified and characterized a large set of submicroscopic insertion and deletion (indel) variants (between 1 and 200 kb in length) among ten inbred lines from a single natural population of the plant species Mimulus guttatus. After extensive computational filtering, we focused on a set of 4,142 high-confidence indels that showed an experimental validation rate of 73%. All but one of these indels were less than 200 kb. Although the largest were generally at lower frequencies in the population, a surprising number of large indels are at intermediate frequencies. Although indels overlapping with genes were much rarer than expected by chance, approximately 600 genes were affected by an indel. Nucleotide-binding site leucine-rich repeat (NBS–LRR) defense response genes were the most enriched among the gene families affected. Most indels associated with genes were rare and appeared to be under purifying selection, though we do find four high-frequency derived insertion alleles that show signatures of recent positive selection.
PMCID: PMC3914686  PMID: 24336482
indel; Mimulus guttatus; natural selection; population genomics; structural variation
5.  The vertebrate taxonomy ontology: a framework for reasoning across model organism and species phenotypes 
A hierarchical taxonomy of organisms is a prerequisite for semantic integration of biodiversity data. Ideally, there would be a single, expansive, authoritative taxonomy that includes extinct and extant taxa, information on synonyms and common names, and monophyletic supraspecific taxa that reflect our current understanding of phylogenetic relationships.
As a step towards development of such a resource, and to enable large-scale integration of phenotypic data across vertebrates, we created the Vertebrate Taxonomy Ontology (VTO), a semantically defined taxonomic resource derived from the integration of existing taxonomic compilations, and freely distributed under a Creative Commons Zero (CC0) public domain waiver. The VTO includes both extant and extinct vertebrates and currently contains 106,947 taxonomic terms, 22 taxonomic ranks, 104,736 synonyms, and 162,400 cross-references to other taxonomic resources. Key challenges in constructing the VTO included (1) extracting and merging names, synonyms, and identifiers from heterogeneous sources; (2) structuring hierarchies of terms based on evolutionary relationships and the principle of monophyly; and (3) automating this process as much as possible to accommodate updates in source taxonomies.
The VTO is the primary source of taxonomic information used by the Phenoscape Knowledgebase (, which integrates genetic and evolutionary phenotype data across both model and non-model vertebrates. The VTO is useful for inferring phenotypic changes on the vertebrate tree of life, which enables queries for candidate genes for various episodes in vertebrate evolution.
PMCID: PMC4177199  PMID: 24267744
Data integration; Evolutionary biology; Paleontology; Taxonomic rank
6.  Data reuse and the open data citation advantage 
PeerJ  2013;1:e175.
Background. Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the “citation benefit”. Furthermore, little is known about patterns in data reuse over time and across datasets.
Method and Results. Here, we look at citation rates while controlling for many known citation predictors and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties.
Conclusion. After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered. We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.
PMCID: PMC3792178  PMID: 24109559
Data reuse; Data repositories; Gene expression microarray; Incentives; Data archiving; Open data; Bibliometrics; Information science
7.  500,000 fish phenotypes: The new informatics landscape for evolutionary and developmental biology of the vertebrate skeleton 
The rich phenotypic diversity that characterizes the vertebrate skeleton results from evolutionary changes in regulation of genes that drive development. Although relatively little is known about the genes that underlie the skeletal variation among fish species, significant knowledge of genetics and development is available for zebrafish. Because developmental processes are highly conserved, this knowledge can be leveraged for understanding the evolution of skeletal diversity. We developed the Phenoscape Knowledgebase (KB; to yield testable hypotheses of candidate genes involved in skeletal evolution. We developed a community anatomy ontology for fishes and ontology-based methods to represent complex free-text character descriptions of species in a computable format. With these tools, we populated the KB with comparative morphological data from the literature on over 2,500 teleost fishes (mainly Ostariophysi) resulting in over 500,000 taxon phenotype annotations. The KB integrates these data with similarly structured phenotype data from zebrafish genes ( Using ontology-based reasoning, candidate genes can be inferred for the phenotypes that vary across taxa, thereby uniting genetic and phenotypic data to formulate evo-devo hypotheses. The morphological data in the KB can be browsed, sorted, and aggregated in ways that provide unprecedented possibilities for data mining and discovery.
PMCID: PMC3377363  PMID: 22736877
8.  Science Incubators: Synthesis Centers and Their Role in the Research Ecosystem 
PLoS Biology  2013;11(1):e1001468.
How should funding agencies enable researchers to explore high-risk but potentially high-reward science? One model that appears to work is the NSF-funded synthesis center, an incubator for community-led, innovative science.
PMCID: PMC3545866  PMID: 23335860
9.  Genome-Scale Phylogenetics: Inferring the Plant Tree of Life from 18,896 Gene Trees 
Systematic Biology  2010;60(2):117-125.
Phylogenetic analyses using genome-scale data sets must confront incongruence among gene trees, which in plants is exacerbated by frequent gene duplications and losses. Gene tree parsimony (GTP) is a phylogenetic optimization criterion in which a species tree that minimizes the number of gene duplications induced among a set of gene trees is selected. The run time performance of previous implementations has limited its use on large-scale data sets. We used new software that incorporates recent algorithmic advances to examine the performance of GTP on a plant data set consisting of 18,896 gene trees containing 510,922 protein sequences from 136 plant taxa (giving a combined alignment length of >2.9 million characters). The relationships inferred from the GTP analysis were largely consistent with previous large-scale studies of backbone plant phylogeny and resolved some controversial nodes. The placement of taxa that were present in few gene trees generally varied the most among GTP bootstrap replicates. Excluding these taxa either before or after the GTP analysis revealed high levels of phylogenetic support across plants. The analyses supported magnoliids sister to a eudicot + monocot clade and did not support the eurosid I and II clades. This study presents a nuclear genomic perspective on the broad-scale phylogenic relationships among plants, and it demonstrates that nuclear genes with a history of duplication and loss can be phylogenetically informative for resolving the plant tree of life.
PMCID: PMC3038350  PMID: 21186249
Gene tree–species tree reconciliation; gene tree parsimony; plant phylogeny; phylogenomics
10.  Taking the First Steps towards a Standard for Reporting on Phylogenies: Minimal Information about a Phylogenetic Analysis (MIAPA) 
In the eight years since phylogenomics was introduced as the intersection of genomics and phylogenetics, the field has provided fundamental insights into gene function, genome history and organismal relationships. The utility of phylogenomics is growing with the increase in the number and diversity of taxa for which whole genome and large transcriptome sequence sets are being generated. We assert that the synergy between genomic and phylogenetic perspectives in comparative biology would be enhanced by the development and refinement of minimal reporting standards for phylogenetic analyses. Encouraged by the development of the Minimum Information About a Microarray Experiment (MIAME) standard, we propose a similar roadmap for the development of a Minimal Information About a Phylogenetic Analysis (MIAPA) standard. Key in the successful development and implementation of such a standard will be broad participation by developers of phylogenetic analysis software, phylogenetic database developers, practitioners of phylogenomics, and journal editors.
PMCID: PMC3167193  PMID: 16901231
11.  The iPlant Collaborative: Cyberinfrastructure for Plant Biology 
The iPlant Collaborative (iPlant) is a United States National Science Foundation (NSF) funded project that aims to create an innovative, comprehensive, and foundational cyberinfrastructure in support of plant biology research (PSCIC, 2006). iPlant is developing cyberinfrastructure that uniquely enables scientists throughout the diverse fields that comprise plant biology to address Grand Challenges in new ways, to stimulate and facilitate cross-disciplinary research, to promote biology and computer science research interactions, and to train the next generation of scientists on the use of cyberinfrastructure in research and education. Meeting humanity's projected demands for agricultural and forest products and the expectation that natural ecosystems be managed sustainably will require synergies from the application of information technologies. The iPlant cyberinfrastructure design is based on an unprecedented period of research community input, and leverages developments in high-performance computing, data storage, and cyberinfrastructure for the physical sciences. iPlant is an open-source project with application programming interfaces that allow the community to extend the infrastructure to meet its needs. iPlant is sponsoring community-driven workshops addressing specific scientific questions via analysis tool integration and hypothesis testing. These workshops teach researchers how to add bioinformatics tools and/or datasets into the iPlant cyberinfrastructure enabling plant scientists to perform complex analyses on large datasets without the need to master the command-line or high-performance computational services.
PMCID: PMC3355756  PMID: 22645531
cyberinfrastructure; bioinformatics; plant biology; computational biology
12.  Evolutionary Characters, Phenotypes and Ontologies: Curating Data from the Systematic Biology Literature 
PLoS ONE  2010;5(5):e10708.
The wealth of phenotypic descriptions documented in the published articles, monographs, and dissertations of phylogenetic systematics is traditionally reported in a free-text format, and it is therefore largely inaccessible for linkage to biological databases for genetics, development, and phenotypes, and difficult to manage for large-scale integrative work. The Phenoscape project aims to represent these complex and detailed descriptions with rich and formal semantics that are amenable to computation and integration with phenotype data from other fields of biology. This entails reconceptualizing the traditional free-text characters into the computable Entity-Quality (EQ) formalism using ontologies.
Methodology/Principal Findings
We used ontologies and the EQ formalism to curate a collection of 47 phylogenetic studies on ostariophysan fishes (including catfishes, characins, minnows, knifefishes) and their relatives with the goal of integrating these complex phenotype descriptions with information from an existing model organism database (zebrafish, We developed a curation workflow for the collection of character, taxonomic and specimen data from these publications. A total of 4,617 phenotypic characters (10,512 states) for 3,449 taxa, primarily species, were curated into EQ formalism (for a total of 12,861 EQ statements) using anatomical and taxonomic terms from teleost-specific ontologies (Teleost Anatomy Ontology and Teleost Taxonomy Ontology) in combination with terms from a quality ontology (Phenotype and Trait Ontology). Standards and guidelines for consistently and accurately representing phenotypes were developed in response to the challenges that were evident from two annotation experiments and from feedback from curators.
The challenges we encountered and many of the curation standards and methods for improving consistency that we developed are generally applicable to any effort to represent phenotypes using ontologies. This is because an ontological representation of the detailed variations in phenotype, whether between mutant or wildtype, among individual humans, or across the diversity of species, requires a process by which a precise combination of terms from domain ontologies are selected and organized according to logical relations. The efficiencies that we have developed in this process will be useful for any attempt to annotate complex phenotypic descriptions using ontologies. We also discuss some ramifications of EQ representation for the domain of systematics.
PMCID: PMC2873956  PMID: 20505755
13.  Phenex: Ontological Annotation of Phenotypic Diversity 
PLoS ONE  2010;5(5):e10500.
Phenotypic differences among species have long been systematically itemized and described by biologists in the process of investigating phylogenetic relationships and trait evolution. Traditionally, these descriptions have been expressed in natural language within the context of individual journal publications or monographs. As such, this rich store of phenotype data has been largely unavailable for statistical and computational comparisons across studies or integration with other biological knowledge.
Methodology/Principal Findings
Here we describe Phenex, a platform-independent desktop application designed to facilitate efficient and consistent annotation of phenotypic similarities and differences using Entity-Quality syntax, drawing on terms from community ontologies for anatomical entities, phenotypic qualities, and taxonomic names. Phenex can be configured to load only those ontologies pertinent to a taxonomic group of interest. The graphical user interface was optimized for evolutionary biologists accustomed to working with lists of taxa, characters, character states, and character-by-taxon matrices.
Annotation of phenotypic data using ontologies and globally unique taxonomic identifiers will allow biologists to integrate phenotypic data from different organisms and studies, leveraging decades of work in systematics and comparative morphology.
PMCID: PMC2864769  PMID: 20463926
14.  The Teleost Anatomy Ontology: Anatomical Representation for the Genomics Age 
Systematic Biology  2010;59(4):369-383.
The rich knowledge of morphological variation among organisms reported in the systematic literature has remained in free-text format, impractical for use in large-scale synthetic phylogenetic work. This noncomputable format has also precluded linkage to the large knowledgebase of genomic, genetic, developmental, and phenotype data in model organism databases. We have undertaken an effort to prototype a curated, ontology-based evolutionary morphology database that maps to these genetic databases ( to facilitate investigation into the mechanistic basis and evolution of phenotypic diversity. Among the first requirements in establishing this database was the development of a multispecies anatomy ontology with the goal of capturing anatomical data in a systematic and computable manner. An ontology is a formal representation of a set of concepts with defined relationships between those concepts. Multispecies anatomy ontologies in particular are an efficient way to represent the diversity of morphological structures in a clade of organisms, but they present challenges in their development relative to single-species anatomy ontologies. Here, we describe the Teleost Anatomy Ontology (TAO), a multispecies anatomy ontology for teleost fishes derived from the Zebrafish Anatomical Ontology (ZFA) for the purpose of annotating varying morphological features across species. To facilitate interoperability with other anatomy ontologies, TAO uses the Common Anatomy Reference Ontology as a template for its upper level nodes, and TAO and ZFA are synchronized, with zebrafish terms specified as subtypes of teleost terms. We found that the details of ontology architecture have ramifications for querying, and we present general challenges in developing a multispecies anatomy ontology, including refinement of definitions, taxon-specific relationships among terms, and representation of taxonomically variable developmental pathways.
PMCID: PMC2885267  PMID: 20547776
Bioinformatics; devo-evo; fish; morphology; ontology; Teleostei
15.  A hierarchical model for incomplete alignments in phylogenetic inference 
Bioinformatics  2009;25(5):592-598.
Motivation: Full-length DNA and protein sequences that span the entire length of a gene are ideally used for multiple sequence alignments (MSAs) and the subsequent inference of their relationships. Frequently, however, MSAs contain a substantial amount of missing data. For example, expressed sequence tags (ESTs), which are partial sequences of expressed genes, are the predominant source of sequence data for many organisms. The patterns of missing data typical for EST-derived alignments greatly compromise the accuracy of estimated phylogenies.
Results: We present a statistical method for inferring phylogenetic trees from EST-based incomplete MSA data. We propose a class of hierarchical models for modeling pairwise distances between the sequences, and develop a fully Bayesian approach for estimation of the model parameters. Once the distance matrix is estimated, the phylogenetic tree may be constructed by applying neighbor-joining (or any other algorithm of choice). We also show that maximizing the marginal likelihood from the Bayesian approach yields similar results to a profile likelihood estimation. The proposed methods are illustrated using simulated protein families, for which the true phylogeny is known, and one real protein family.
Availability: R code for fitting these models are available from:
Supplementary information: Supplemantary data are available at Bioinformatics online.
PMCID: PMC2647833  PMID: 19147663
16.  Compensatory Evolution in RNA Secondary Structures Increases Substitution Rate Variation among Sites 
Molecular Biology and Evolution  2008;25(8):1778-1787.
There is growing evidence that interactions between biological molecules (e.g., RNA–RNA, protein–protein, RNA–protein) place limits on the rate and trajectory of molecular evolution. Here, by extending Kimura's model of compensatory evolution at interacting sites, we show that the ratio of transition to transversion substitutions (κ) at interacting sites should be equal to the square of the ratio at independent sites. Because transition mutations generally occur at a higher rate than transversions, the model predicts that κ should be higher at interacting sites than at independent sites. We tested this prediction in 10 RNA secondary structures by comparing phylogenetically derived estimates of κ in paired sites within stems (κp) and unpaired sites within loops (κu). Eight of the 10 structures showed an excellent match to the quantitative predictions of the model, and 9 of the 10 structures matched the qualitative prediction κp > κu. Only the Rev response element from the human immunovirus (HIV) genome showed the reverse pattern, with κp < κu. Although a variety of evolutionary forces could produce quantitative deviations from the model predictions, the reversal in magnitude of κp and κu could be achieved only by violating the model assumption that the underlying transition (or transversion) mutation rates were identical in paired and unpaired regions of the molecule. We explore the ability of the APOBEC3 enzymes, host defense mechanisms against retroviruses, which induce transition mutations preferentially in single-stranded regions of the HIV genome, to explain this exception to the rule. Taken as a whole, our findings suggest that κ may have utility as a simple diagnostic to evaluate proposed secondary structures.
PMCID: PMC2734131  PMID: 18535013
molecular evolution; RNA secondary structure; compensatory evolution; transition–transversion ratio
17.  Systematic Identification of Balanced Transposition Polymorphisms in Saccharomyces cerevisiae 
PLoS Genetics  2009;5(6):e1000502.
High-throughput techniques for detecting DNA polymorphisms generally do not identify changes in which the genomic position of a sequence, but not its copy number, varies among individuals. To explore such balanced structural polymorphisms, we used array-based Comparative Genomic Hybridization (aCGH) to conduct a genome-wide screen for single-copy genomic segments that occupy different genomic positions in the standard laboratory strain of Saccharomyces cerevisiae (S90) and a polymorphic wild isolate (Y101) through analysis of six tetrads from a cross of these two strains. Paired-end high-throughput sequencing of Y101 validated four of the predicted rearrangements. The transposed segments contained one to four annotated genes each, yet crosses between S90 and Y101 yielded mostly viable tetrads. The longest segment comprised 13.5 kb near the telomere of chromosome XV in the S288C reference strain and Southern blotting confirmed its predicted location on chromosome IX in Y101. Interestingly, inter-locus crossover events between copies of this segment occurred at a detectable rate. The presence of low-copy repetitive sequences at the junctions of this segment suggests that it may have arisen through ectopic recombination. Our methodology and findings provide a starting point for exploring the origins, phenotypic consequences, and evolutionary fate of this largely unexplored form of genomic polymorphism.
Author Summary
Balanced structural polymorphisms are differences in the relative arrangement of genomic features within species that do not affect DNA copy number. Little is known about their prevalence or importance because they are difficult to observe. Here, we present a novel methodology for systematically identifying such polymorphisms based on the idea that single-copy DNA that occupies different genomic locations in two parents will segregate independently during meiosis and will therefore reveal itself as a copy number difference among a fraction of progeny. Comparative hybridization reveals multiple balanced structural polymorphisms that involve changes to gene order in two strains of yeast; the results are independently validated using paired-end whole genome shotgun sequencing. The longest transposed segment we identify comprises 13.5 kb near the telomere of chromosome XV in the S288C reference strain and contains several annotated genes. We map the location of this polymorphism in the non-reference strain using genome-wide genotypic data, which also reveals an appreciable frequency of ectopic recombination among transposed segment pairs. The breakpoints of the remaining polymorphisms are localized by the paired-end sequence data. Our work provides proof-of-principle for a very general approach to systematically identify all balanced genomic polymorphisms in two different genotypes and is a starting point for understanding the frequency, evolutionary origins, and functional consequences of this seldom-studied class of genomic structural variation in eukaryotes.
PMCID: PMC2682701  PMID: 19503594
18.  Using ESTs for phylogenomics: Can one accurately infer a phylogenetic tree from a gappy alignment? 
While full genome sequences are still only available for a handful of taxa, large collections of partial gene sequences are available for many more. The alignment of partial gene sequences results in a multiple sequence alignment containing large gaps that are arranged in a staggered pattern. The consequences of this pattern of missing data on the accuracy of phylogenetic analysis are not well understood. We conducted a simulation study to determine the accuracy of phylogenetic trees obtained from gappy alignments using three commonly used phylogenetic reconstruction methods (Neighbor Joining, Maximum Parsimony, and Maximum Likelihood) and studied ways to improve the accuracy of trees obtained from such datasets.
We found that the pattern of gappiness in multiple sequence alignments derived from partial gene sequences substantially compromised phylogenetic accuracy even in the absence of alignment error. The decline in accuracy was beyond what would be expected based on the amount of missing data. The decline was particularly dramatic for Neighbor Joining and Maximum Parsimony, where the majority of gappy alignments contained 25% to 40% incorrect quartets. To improve the accuracy of the trees obtained from a gappy multiple sequence alignment, we examined two approaches. In the first approach, alignment masking, potentially problematic columns and input sequences are excluded from from the dataset. Even in the absence of alignment error, masking improved phylogenetic accuracy up to 100-fold. However, masking retained, on average, only 83% of the input sequences. In the second approach, alignment subdivision, the missing data is statistically modelled in order to retain as many sequences as possible in the phylogenetic analysis. Subdivision resulted in more modest improvements to alignment accuracy, but succeeded in including almost all of the input sequences.
These results demonstrate that partial gene sequences and gappy multiple sequence alignments can pose a major problem for phylogenetic analysis. The concern will be greatest for high-throughput phylogenomic analyses, in which Neighbor Joining is often the preferred method due to its computational efficiency. Both approaches can be used to increase the accuracy of phylogenetic inference from a gappy alignment. The choice between the two approaches will depend upon how robust the application is to the loss of sequences from the input set, with alignment masking generally giving a much greater improvement in accuracy but at the cost of discarding a larger number of the input sequences.
PMCID: PMC2359737  PMID: 18366758
19.  The 2006 NESCent Phyloinformatics Hackathon: A Field Report 
In December, 2006, a group of 26 software developers from some of the most widely used life science programming toolkits and phylogenetic software projects converged on Durham, North Carolina, for a Phyloinformatics Hackathon, an intense five-day collaborative software coding event sponsored by the National Evolutionary Synthesis Center (NESCent). The goal was to help researchers to integrate multiple phylogenetic software tools into automated workflows. Participants addressed deficiencies in interoperability between programs by implementing “glue code” and improving support for phylogenetic data exchange standards (particularly NEXUS) across the toolkits. The work was guided by use-cases compiled in advance by both developers and users, and the code was documented as it was developed. The resulting software is freely available for both users and developers through incorporation into the distributions of several widely-used open-source toolkits. We explain the motivation for the hackathon, how it was organized, and discuss some of the outcomes and lessons learned. We conclude that hackathons are an effective mode of solving problems in software interoperability and usability, and are underutilized in scientific software development.
PMCID: PMC2684128
phylogenetics; phyloinformatics; open source software; analysis workflow
20.  Tracking the evolution of alternatively spliced exons within the Dscam family 
The Dscam gene in the fruit fly, Drosophila melanogaster, contains twenty-four exons, four of which are composed of tandem arrays that each undergo mutually exclusive alternative splicing (4, 6, 9 and 17), potentially generating 38,016 protein isoforms. This degree of transcript diversity has not been found in mammalian homologs of Dscam. We examined the molecular evolution of exons within this gene family to locate the point of divergence for this alternative splicing pattern.
Using the fruit fly Dscam exons 4, 6, 9 and 17 as seed sequences, we iteratively searched sixteen genomes for homologs, and then performed phylogenetic analyses of the resulting sequences to examine their evolutionary history. We found homologs in the nematode, arthropod and vertebrate genomes, including homologs in several vertebrates where Dscam had not been previously annotated. Among these, only the arthropods contain homologs arranged in tandem arrays indicative of mutually exclusive splicing. We found no homologs to these exons within the Arabidopsis, yeast, tunicate or sea urchin genomes but homologs to several constitutive exons from fly Dscam were present within tunicate and sea urchin. Comparing the rate of turnover within the tandem arrays of the insect taxa (fruit fly, mosquito and honeybee), we found the variants within exons 4 and 17 are well conserved in number and spatial arrangement despite 248–283 million years of divergence. In contrast, the variants within exons 6 and 9 have undergone considerable turnover since these taxa diverged, as indicated by deeply branching taxon-specific lineages.
Our results suggest that at least one Dscam exon array may be an ancient duplication that predates the divergence of deuterostomes from protostomes but that there is no evidence for the presence of arrays in the common ancestor of vertebrates. The different patterns of conservation and turnover among the Dscam exon arrays provide a striking example of how a gene can evolve in a modular fashion rather than as a single unit.
PMCID: PMC1397879  PMID: 16483367
21.  Phytome: a platform for plant comparative genomics 
Nucleic Acids Research  2005;34(Database issue):D724-D730.
Phytome is an online comparative genomics resource that can be applied to functional plant genomics, molecular breeding and evolutionary studies. It contains predicted protein sequences, protein family assignments, multiple sequence alignments, phylogenies and functional annotations for proteins from a large, phylogenetically diverse set of plant taxa. Phytome serves as a glue between disparate plant gene databases both by identifying the evolutionary relationships among orthologous and paralogous protein sequences from different species and by enabling cross-references between different versions of the same gene curated independently by different database groups. The web interface enables sophisticated queries on lineage-specific patterns of gene/protein family proliferation and loss. This rich dataset is serving as a platform for the unification of sequence-anchored comparative maps across taxonomic families of plants. The Phytome web interface can be accessed at the following URL: . Batch homology searches and bulk downloads are available upon free registration.
PMCID: PMC1347408  PMID: 16381967
22.  An international showcase of bioinformatics research 
Genome Biology  2003;4(9):337.
A report on the 11th International Conference on Intelligent Systems for Molecular Biology, Brisbane, Queensland, Australia, 29 June - 3 July 2003.
A report on the 11th International Conference on Intelligent Systems for Molecular Biology, Brisbane, Queensland, Australia, 29 June - 3 July 2003.
PMCID: PMC193652  PMID: 12952531

Results 1-22 (22)