As part of the development of the database Bgee (a dataBase for Gene Expression Evolution), we annotate and analyse expression data from different types and different sources, notably Affymetrix data from GEO and ArrayExpress, and RNA-Seq data from SRA. During our quality control procedure, we have identified duplicated content in GEO and ArrayExpress, affecting ∼14% of our data: fully or partially duplicated experiments from independent data submissions, Affymetrix chips reused in several experiments, or reused within an experiment. We present here the procedure that we have established to filter such duplicates from Affymetrix data, and our procedure to identify future potential duplicates in RNA-Seq data.
Harnessing community intelligence in knowledge curation bears significant promise in dealing with communication and education in the flood of scientific knowledge. As knowledge is accumulated at ever-faster rates, scientific nomenclature, a particular kind of knowledge, is concurrently generated in all kinds of fields. Since nomenclature is a system of terms used to name things in a particular discipline, accurate translation of scientific nomenclature in different languages is of critical importance, not only for communications and collaborations with English-speaking people, but also for knowledge dissemination among people in the non-English-speaking world, particularly young students and researchers. However, it lacks of accuracy and standardization when translating scientific nomenclature from English to other languages, especially for those languages that do not belong to the same language family as English. To address this issue, here we propose for the first time the application of community intelligence in scientific nomenclature management, namely, harnessing collective intelligence for translation of scientific nomenclature from English to other languages. As community intelligence applied to knowledge curation is primarily aided by wiki and Chinese is the native language for about one-fifth of the world’s population, we put the proposed application into practice, by developing a wiki-based English-to-Chinese Scientific Nomenclature Dictionary (ESND; http://esnd.big.ac.cn). ESND is a wiki-based, publicly editable and open-content platform, exploiting the whole power of the scientific community in collectively and collaboratively managing scientific nomenclature. Based on community curation, ESND is capable of achieving accurate, standard, and comprehensive scientific nomenclature, demonstrating a valuable application of community intelligence in knowledge curation.
Residue-residue interactions that fold a protein into a unique three-dimensional structure and make it play a specific function impose structural and functional constraints in varying degrees on each residue site. Selective constraints on residue sites are recorded in amino acid orders in homologous sequences and also in the evolutionary trace of amino acid substitutions. A challenge is to extract direct dependences between residue sites by removing phylogenetic correlations and indirect dependences through other residues within a protein or even through other molecules. Rapid growth of protein families with unknown folds requires an accurate de novo prediction method for protein structure. Recent attempts of disentangling direct from indirect dependences of amino acid types between residue positions in multiple sequence alignments have revealed that inferred residue-residue proximities can be sufficient information to predict a protein fold without the use of known three-dimensional structures. Here, we propose an alternative method of inferring coevolving site pairs from concurrent and compensatory substitutions between sites in each branch of a phylogenetic tree. Substitution probability and physico-chemical changes (volume, charge, hydrogen-bonding capability, and others) accompanied by substitutions at each site in each branch of a phylogenetic tree are estimated with the likelihood of each substitution, and their direct correlations between sites are used to detect concurrent and compensatory substitutions. In order to extract direct dependences between sites, partial correlation coefficients of the characteristic changes along branches between sites, in which linear multiple dependences on feature vectors at other sites are removed, are calculated and used to rank coevolving site pairs. Accuracy of contact prediction based on the present coevolution score is comparable to that achieved by a maximum entropy model of protein sequences for 15 protein families taken from the Pfam release 26.0. Besides, this excellent accuracy indicates that compensatory substitutions are significant in protein evolution.
The skeleton is of fundamental importance in research in comparative vertebrate morphology, paleontology, biomechanics, developmental biology, and systematics. Motivated by research questions that require computational access to and comparative reasoning across the diverse skeletal phenotypes of vertebrates, we developed a module of anatomical concepts for the skeletal system, the Vertebrate Skeletal Anatomy Ontology (VSAO), to accommodate and unify the existing skeletal terminologies for the species-specific (mouse, the frog Xenopus, zebrafish) and multispecies (teleost, amphibian) vertebrate anatomy ontologies. Previous differences between these terminologies prevented even simple queries across databases pertaining to vertebrate morphology. This module of upper-level and specific skeletal terms currently includes 223 defined terms and 179 synonyms that integrate skeletal cells, tissues, biological processes, organs (skeletal elements such as bones and cartilages), and subdivisions of the skeletal system. The VSAO is designed to integrate with other ontologies, including the Common Anatomy Reference Ontology (CARO), Gene Ontology (GO), Uberon, and Cell Ontology (CL), and it is freely available to the community to be updated with additional terms required for research. Its structure accommodates anatomical variation among vertebrate species in development, structure, and composition. Annotation of diverse vertebrate phenotypes with this ontology will enable novel inquiries across the full spectrum of phenotypic diversity.
Vertebrate interferon-induced transmembrane (IFITM) genes have been demonstrated to have extensive and diverse functions, playing important roles in the evolution of vertebrates. Despite observance of their functionality, the evolutionary dynamics of this gene family are complex and currently unknown. Here, we performed detailed evolutionary analyses to unravel the evolutionary history of the vertebrate IFITM family. A total of 174 IFITM orthologous genes and 112 pseudogenes were identified from 27 vertebrate genome sequences. The vertebrate IFITM family can be divided into immunity-related IFITM (IR-IFITM), IFITM5 and IFITM10 sub-families in phylogeny, implying origins from three different progenitors. In general, vertebrate IFITM genes are located in two loci, one containing the IFITM10 gene, and the other locus containing IFITM5 and various numbers of IR-IFITM genes. Conservation of evolutionary synteny was observed in these IFITM genes. Significant functional divergence was detected among the three IFITM sub-families. No gene duplication or positive selection was found in IFITM5 sub-family, implying the functional conservation of IFITM5 in vertebrate evolution, which is involved in bone formation. No IFITM5 locus was identified in the marmoset genome, suggesting a potential association with the tiny size of this monkey. The IFITM10 sub-family was divided into two groups: aquatic and terrestrial types. Functional divergence was detected between the two groups, and five IFITM10-like genes from frog were dispersed into the two groups. Both gene duplication and positive selection were observed in aquatic vertebrate IFITM10-like genes, indicating that IFITM10 might be associated with the adaptation to aquatic environments. A large number of lineage- and species-specific gene duplications were observed in IR-IFITM sub-family and positive selection was detected in IR-IFITM of primates and rodents. Because primates have experienced a long history of viral infection, such rapid expansion and positive selection suggests that the evolution of primate IR-IFITM genes is associated with broad-spectrum antiviral activity.
The regulatory mechanisms of determining which genes specifically expressed in which tissues are still not fully elucidated, especially in plants. Using internal correspondence analysis, I first establish that tissue-specific genes exhibit significantly different synonymous codon usage in rice, although this effect is weak. The variability of synonymous codon usage between tissues accounts for 5.62% of the total codon usage variability, which has mainly arisen from the neutral evolutionary forces, such as GC content variation among tissues. Moreover, tissue-specific genes are under differential selective constraints, inferring that natural selection also contributes to the codon usage divergence between tissues. These findings may add further evidence in understanding the differentiation and regulation of tissue-specific gene products in plants.
The 5th International Biocuration Conference brought together over 300 scientists to exchange on their work, as well as discuss issues relevant to the International Society for Biocuration’s (ISB) mission. Recurring themes this year included the creation and promotion of gold standards, the need for more ontologies, and more formal interactions with journals. The conference is an essential part of the ISB's goal to support exchanges among members of the biocuration community. Next year's conference will be held in Cambridge, UK, from 7 to 10 April 2013. In the meanwhile, the ISB website provides information about the society's activities (http://biocurator.org), as well as related events of interest.
Amelogenin, the major enamel matrix protein in tooth development, has been demonstrated to play a significant role in tooth enamel formation. Previous studies have identified the alternative splicing of amelogenin in many mammalian vertebrates as one mechanism for amelogenin heterogeneous expression in teeth. While amelogenin and its splicing forms in mammalian vertebrates have been cloned and sequenced, the amelogenin gene, especially its splicing forms in non-mammalian species, remains largely unknown. To better understand the mechanism underlying amelogenin evolution, we previously cloned and characterized an amelogenin gene sequence from a squamate, the green iguana. In this study, we employed RT-PCR to amplify the amelogenin gene from the black spiny-tailed iguana Ctenosaura similis teeth, and discovered a novel splicing form of the amelogenin gene. The transcript of the newly identified iguana amelogenin gene (named C. Similis-T2L) is 873 nucleotides long encoding an expected polypeptide of 206 amino acids. The C. Similis-T2L contains a unique exon denominated exon X, which is located between exon 5 and exon 6. The C. Similis-T2L contains 7 exons including exon 1, 2, 3, 5, X, 6, and 7. Analysis of the secondary and tertiary structures of T2L amelogenin protein demonstrated that exon X has a dramatic effect on the amelogenin structures. This is the first report to provide definitive evidence for the amelogenin alternative splicing in non-mammalian vertebrates, revealing a unique exon X and the splicing form of the amelogenin gene transcript in Ctenosaura similis.
Salmonella Paratyphi A (S. Paratyphi A) is a highly adapted, human-specific pathogen that causes paratyphoid fever. Cases of paratyphoid fever have recently been increasing, and the disease is becoming a major public health concern, especially in Eastern and Southern Asia. To investigate the genomic variation and evolution of S. Paratyphi A, a pan-genomic analysis was performed on five newly sequenced S. Paratyphi A strains and two other reference strains. A whole genome comparison revealed that the seven genomes are collinear and that their organization is highly conserved. The high rate of substitutions in part of the core genome indicates that there are frequent homologous recombination events. Based on the changes in the pan-genome size and cluster number (both in the core functional genes and core pseudogenes), it can be inferred that the sharply increasing number of pseudogene clusters may have strong correlation with the inactivation of functional genes, and indicates that the S. Paratyphi A genome is being degraded.
Insects are the most diverse group of animals on the planet, comprising over 90% of all metazoan life forms, and have adapted to a wide diversity of ecosystems in nearly all environments. They have evolved highly sensitive chemical senses that are central to their interaction with their environment and to communication between individuals. Understanding the molecular bases of insect olfaction is therefore of great importance from both a basic and applied perspective. Odorant binding proteins (OBPs) are some of most abundant proteins found in insect olfactory organs, where they are the first component of the olfactory transduction cascade, carrying odorant molecules to the olfactory receptors. We carried out a search for OBPs in the genome of the parasitoid wasp Nasonia vitripennis and identified 90 sequences encoding putative OBPs. This is the largest OBP family so far reported in insects. We report unique features of the N. vitripennis OBPs, including the presence and evolutionary origin of a new subfamily of double-domain OBPs (consisting of two concatenated OBP domains), the loss of conserved cysteine residues and the expression of pseudogenes. This study also demonstrates the extremely dynamic evolution of the insect OBP family: (i) the number of different OBPs can vary greatly between species; (ii) the sequences are highly diverse, sometimes as a result of positive selection pressure with even the canonical cysteines being lost; (iii) new lineage specific domain arrangements can arise, such as the double domain OBP subfamily of wasps and mosquitoes.
Gene duplication is a source of molecular innovation throughout evolution. However, even with massive amounts of genome sequence data, correlating gene duplication with speciation and other events in natural history can be difficult. This is especially true in its most interesting cases, where rapid and multiple duplications are likely to reflect adaptation to rapidly changing environments and life styles. This may be so for Class I of alcohol dehydrogenases (ADH1s), where multiple duplications occurred in primate lineages in Old and New World monkeys (OWMs and NWMs) and hominoids.
To build a preferred model for the natural history of ADH1s, we determined the sequences of nine new ADH1 genes, finding for the first time multiple paralogs in various prosimians (lemurs, strepsirhines). Database mining then identified novel ADH1 paralogs in both macaque (an OWM) and marmoset (a NWM). These were used with the previously identified human paralogs to resolve controversies relating to dates of duplication and gene conversion in the ADH1 family. Central to these controversies are differences in the topologies of trees generated from exonic (coding) sequences and intronic sequences.
We provide evidence that gene conversions are the primary source of difference, using molecular clock dating of duplications and analyses of microinsertions and deletions (micro-indels). The tree topology inferred from intron sequences appear to more correctly represent the natural history of ADH1s, with the ADH1 paralogs in platyrrhines (NWMs) and catarrhines (OWMs and hominoids) having arisen by duplications shortly predating the divergence of OWMs and NWMs. We also conclude that paralogs in lemurs arose independently. Finally, we identify errors in database interpretation as the source of controversies concerning gene conversion. These analyses provide a model for the natural history of ADH1s that posits four ADH1 paralogs in the ancestor of Catarrhine and Platyrrhine primates, followed by the loss of an ADH1 paralog in the human lineage.
Paralemmin-1 is a protein implicated in plasma membrane dynamics, the development of filopodia, neurites and dendritic spines, as well as the invasiveness and metastatic potential of cancer cells. However, little is known about its mode of action, or about the biological functions of the other paralemmin isoforms: paralemmin-2, paralemmin-3 and palmdelphin. We describe here evolutionary analyses of the paralemmin gene family in a broad range of vertebrate species. Our results suggest that the four paralemmin isoform genes (PALM1, PALM2, PALM3 and PALMD) arose by quadruplication of an ancestral gene in the two early vertebrate genome duplications. Paralemmin-1 and palmdelphin were further duplicated in the teleost fish specific genome duplication. We identified a unique sequence motif common to all paralemmins, consisting of 11 highly conserved residues of which four are invariant. A single full-length paralemmin homolog with this motif was identified in the genome of the sea lamprey Petromyzon marinus and an isolated putative paralemmin motif could be detected in the genome of the lancelet Branchiostoma floridae. This allows us to conclude that the paralemmin gene family arose early and has been maintained throughout vertebrate evolution, suggesting functional diversification and specific biological roles of the paralemmin isoforms. The paralemmin genes have also maintained specific features of gene organisation and sequence. This includes the occurrence of closely linked downstream genes, initially identified as a readthrough fusion protein with mammalian paralemmin-2 (Palm2-AKAP2). We have found evidence for such an arrangement for paralemmin-1 and -2 in several vertebrate genomes, as well as for palmdelphin and paralemmin-3 in teleost fish genomes, and suggest the name paralemmin downstream genes (PDG) for this new gene family. Thus, our findings point to ancient roles for paralemmins and distinct biological functions of the gene duplicates.
Silk spinning is essential to spider ecology and has had a key role in the expansive diversification of spiders. Silk is composed primarily of proteins called spidroins, which are encoded by a multi-gene family. Spidroins have been studied extensively in the derived clade, Orbiculariae (orb-weavers), from the suborder Araneomorphae (‘true spiders’). Orbicularians produce a suite of different silks, and underlying this repertoire is a history of duplication and spidroin gene divergence. A second class of silk proteins, Egg Case Proteins (ECPs), is known only from the orbicularian species, Lactrodectus hesperus (Western black widow). In L. hesperus, ECPs bond with tubuliform spidroins to form egg case silk fibers. Because most of the phylogenetic diversity of spiders has not been sampled for their silk genes, there is limited understanding of spidroin gene family history and the prevalence of ECPs. Silk genes have not been reported from the suborder Mesothelae (segmented spiders), which diverged from all other spiders >380 million years ago, and sampling from Mygalomorphae (tarantulas, trapdoor spiders) and basal araneomorph lineages is sparse. In comparison to orbicularians, mesotheles and mygalomorphs have a simpler silk biology and thus are hypothesized to have less diversity of silk genes. Here, we present cDNAs synthesized from the silk glands of six mygalomorph species, a mesothele, and a non-orbicularian araneomorph, and uncover a surprisingly rich silk gene diversity. In particular, we find ECP homologs in the mesothele, suggesting that ECPs were present in the common ancestor of extant spiders, and originally were not specialized to complex with tubuliform spidroins. Furthermore, gene-tree/species-tree reconciliation analysis reveals that numerous spidroin gene duplications occurred after the split between Mesothelae and Opisthothelae (Mygalomorphae plus Araneomorphae). We use the spidroin gene tree to reconstruct the evolution of amino acid compositions of spidroins that perform different ecological functions.
The third complement component (C3) is a central protein of the complement system conserved from fish to mammals. It also showed distinct characteristics in different animal groups. Striking features of the fish complement system were unveiled, including prominent levels of extrahepatic expression and isotypic diversity of the complement components. The evidences of the involvement of complement system in the enhancement of B and T cell responses found in mammals indicated that the complement system also serves as a bridge between the innate and adaptive responses. For the reasons mentioned above, it is interesting to explore the evolutionary process of C3 genes and to investigate whether the huge differences between aquatic and terrestrial environments affected the C3 evolution between fish and mammals.
Analysis revealed that these two groups of animals had experienced different evolution patterns. The mammalian C3 genes were under purifying selection pressure while the positive selection pressure was detected in fish C3 genes. Three periods of positive selection events of C3 genes were also detected. Two happened on the ancestral lineages to all vertebrates and mammals, respectively, one happened on early period of fish evolutionary history.
Three periods of positive selection events had happened on C3 genes during history and the fish and mammals C3 genes experience different evolutionary patterns for their distinct living environments.
The function of most proteins is not determined experimentally, but is extrapolated from homologs. According to the “ortholog conjecture”, or standard model of phylogenomics, protein function changes rapidly after duplication, leading to paralogs with different functions, while orthologs retain the ancestral function. We report here that a comparison of experimentally supported functional annotations among homologs from 13 genomes mostly supports this model. We show that to analyze GO annotation effectively, several confounding factors need to be controlled: authorship bias, variation of GO term frequency among species, variation of background similarity among species pairs, and propagated annotation bias. After controlling for these biases, we observe that orthologs have generally more similar functional annotations than paralogs. This is especially strong for sub-cellular localization. We observe only a weak decrease in functional similarity with increasing sequence divergence. These findings hold over a large diversity of species; notably orthologs from model organisms such as E. coli, yeast or mouse have conserved function with human proteins.
To infer the function of an unknown gene, possibly the most effective way is to identify a well-characterized evolutionarily related gene, and assume that they have both kept their ancestral function. If several such homologs are available, all else being equal, it has long been assumed that those that diverged by speciation (“ortholog”) are functionally closer than those that diverged by duplication (“paralogs”); thus function is more reliably inferred from the former. But despite its prevalence, this model mostly rests on first principles, as for the longest time we have not had sufficient data to test it empirically. Recently, some studies began investigating this question and have cast doubt on the validity of this model. Here, we show that by considering a wide range of organisms and data, and, crucially, by correcting for several easily overlooked biases affecting functional annotations, the standard model is corroborated by the presently available experimental data.
Motivation: Comparative analyses of gene expression data from different species have become an important component of the study of molecular evolution. Thus methods are needed to estimate evolutionary distances between expression profiles, as well as a neutral reference to estimate selective pressure. Divergence between expression profiles of homologous genes is often calculated with Pearson's or Euclidean distance. Neutral divergence is usually inferred from randomized data. Despite being widely used, neither of these two steps has been well studied. Here, we analyze these methods formally and on real data, highlight their limitations and propose improvements.
Results: It has been demonstrated that Pearson's distance, in contrast to Euclidean distance, leads to underestimation of the expression similarity between homologous genes with a conserved uniform pattern of expression. Here, we first extend this study to genes with conserved, but specific pattern of expression. Surprisingly, we find that both Pearson's and Euclidean distances used as a measure of expression similarity between genes depend on the expression specificity of those genes. We also show that the Euclidean distance depends strongly on data normalization. Next, we show that the randomization procedure that is widely used to estimate the rate of neutral evolution is biased when broadly expressed genes are abundant in the data. To overcome this problem, we propose a novel randomization procedure that is unbiased with respect to expression profiles present in the datasets. Applying our method to the mouse and human gene expression data suggests significant gene expression conservation between these species.
Supplementary data are available at Bioinformatics online.
The repeated origin of similar phenotypes is invaluable for studying the underlying genetics of adaptive traits; molecular evidence, however, is lacking for most examples of such similarity. The floral morphology of neotropical Malpighiaceae is distinctive and highly conserved, especially with regard to symmetry, and is thought to result from specialization on oil-bee pollinators. We recently demonstrated that CYCLOIDEA2–like genes (CYC2A and CYC2B) are associated with the development of the stereotypical floral zygomorphy that is critical to this plant–pollinator mutualism. Here, we build on this developmental framework to characterize floral symmetry in three clades of Malpighiaceae that have independently lost their oil bee association and experienced parallel shifts in their floral morphology, especially in regard to symmetry. We show that in each case these species exhibit a loss of CYC2B function, and a strikingly similar shift in the expression of CYC2A that is coincident with their shift in floral symmetry. These results indicate that similar floral phenotypes in this large angiosperm clade have evolved via parallel genetic changes from an otherwise highly conserved developmental program.
Fetal chylothorax (FC) is a rare condition characterized by lymphocyte-rich pleural effusion. Although its pathogenesis remains elusive, it may involve inflammation, since there are increased concentrations of proinflammatory mediators in pleural fluids. Only a few hereditary lymphedema-associated gene loci, e.g. VEGFR3, ITGA9 and PTPN11, were detected in human fetuses with this condition; these cases had a poorer prognosis, due to defective lymphangiogenesis. In the present study, genome-wide gene expression analysis was conducted, comparing pleural and ascitic fluids in three hydropic fetuses, one with and two without the ITGA9 mutation. One fetus (the index case), from a dizygotic pregnancy (the cotwin was unaffected), received antenatal OK-432 pleurodesis and survived beyond the neonatal stage, despite having the ITGA9 mutation. Genes and pathways involved in the immune response were universally up-regulated in fetal pleural fluids compared to those in ascitic fluids. Furthermore, genes involved in the lymphangiogenesis pathway were down-regulated in fetal pleural fluids (compared to ascitic fluid), but following OK-432 pleurodesis, they were up-regulated. Expression of ITGA9 was concordant with overall trends of lymphangiogenesis. In conclusion, we inferred that both the immune response and lymphangiogenesis were implicated in the pathogenesis of fetal chylothorax. Furthermore, genome-wide gene expression microarray analysis may facilitate personalized medicine by selecting the most appropriate treatment, according to the specific circumstances of the patient, for this rare, but heterogeneous disease.
The degree of conservation of gene expression between homologous organs largely remains an open question. Several recent studies reported some evidence in favor of such conservation. Most studies compute organs' similarity across all orthologous genes, whereas the expression level of many genes are not informative about organ specificity.
Here, we use a modularization algorithm to overcome this limitation through the identification of inter-species co-modules of organs and genes. We identify such co-modules using mouse and human microarray expression data. They are functionally coherent both in terms of genes and of organs from both organisms. We show that a large proportion of genes belonging to the same co-module are orthologous between mouse and human. Moreover, their zebrafish orthologs also tend to be expressed in the corresponding homologous organs. Notable exceptions to the general pattern of conservation are the testis and the olfactory bulb. Interestingly, some co-modules consist of single organs, while others combine several functionally related organs. For instance, amygdala, cerebral cortex, hypothalamus and spinal cord form a clearly discernible unit of expression, both in mouse and human.
Our study provides a new framework for comparative analysis which will be applicable also to other sets of large-scale phenotypic data collected across different species.
MicroRNAs (miRNAs) constitute an important class of gene regulators. While models have been proposed to explain their appearance and expansion, the validation of these models has been difficult due to the lack of comparative studies. Here, we analyze miRNA evolutionary patterns in two mammals, human and mouse, in relation to the age of miRNA families. In this comparative framework, we confirm some predictions of previously advanced models of miRNA evolution, e.g. that miRNAs arise more frequently de novo than by duplication, or that the number of protein-coding gene targeted by miRNAs decreases with evolutionary time. We also corroborate that miRNAs display an increase in expression level with evolutionary time, however we show that this relation is largely tissue-dependent, and especially low in embryonic or nervous tissues. We identify a bias of tag-sequencing techniques regarding the assessment of breadth of expression, leading us, contrary to predictions, to find more tissue-specific expression of older miRNAs. Together, our results refine the models used so far to depict the evolution of miRNA genes. They underline the role of tissue-specific selective forces on the evolution of miRNAs, as well as the potential co-evolution patterns between miRNAs and the protein-coding genes they target.
One of the main motivations to study amphioxus is its potential for understanding the last common ancestor of chordates, which notably gave rise to the vertebrates. An important feature in this respect is the slow evolutionary rate that seems to have characterized the cephalochordate lineage, making amphioxus an interesting proxy for the chordate ancestor, as well as a key lineage to include in comparative studies. Whereas slow evolution was first noticed at the phenotypic level, it has also been described at the genomic level. Here, we examine whether the amphioxus genome is indeed a good proxy for the genome of the chordate ancestor, with a focus on protein-coding genes. We investigate genome features, such as synteny, gene duplication and gene loss, and contrast the amphioxus genome with those of other deuterostomes that are used in comparative studies, such as Ciona, Oikopleura and urchin.
deuterostomes; evolutionary rates; gene duplication; gene loss; orthology; synteny
Motivation: Most anatomical ontologies are species-specific, whereas a framework for comparative studies is needed. We describe the vertebrate Homologous Organs Groups ontology, vHOG, used to compare expression patterns between species.
Results: vHOG is a multispecies anatomical ontology for the vertebrate lineage. It is based on the HOGs used in the Bgee database of gene expression evolution. vHOG version 1.4 includes 1184 terms, follows OBO principles and is based on the Common Anatomy Reference Ontology (CARO). vHOG only describes structures with historical homology relations between model vertebrate species. The mapping to species-specific anatomical ontologies is provided as a separate file, so that no homology hypothesis is stated within the ontology itself. Each mapping has been manually reviewed, and we provide support codes and references when available.
Availability and implementation: vHOG is available from the Bgee download site (http://bgee.unil.ch/), as well as from the OBO Foundry and the NCBO Bioportal websites.
Many studies have been published outlining the global effects of 17β-estradiol (E2) on gene expression in human epithelial breast cancer derived MCF-7 cells. These studies show large variation in results, reporting between ~100 and ~1500 genes regulated by E2, with poor overlap.
We performed a meta-analysis of these expression studies, using the Rank product method to obtain a more accurate and stable list of the differentially expressed genes, and of pathways regulated by E2. We analyzed 9 time-series data sets, concentrating on response at 3-4 hrs (early) and at 24 hrs (late). We found >1000 statistically significant probe sets after correction for multiple testing at 3-4 hrs, and >2000 significant probe sets at 24 hrs. Differentially expressed genes were examined by pathway analysis. This revealed 15 early response pathways, mostly related to cell signaling and proliferation, and 20 late response pathways, mostly related to breast cancer, cell division, DNA repair and recombination.
Our results confirm that meta-analysis identified more differentially expressed genes than the individual studies, and that these genes act together in networks. These results provide new insight into E2 regulated mechanisms, especially in the context of breast cancer.
microarray; meta-analysis; estrogen; breast cancer; pathways
Phylogenomic databases provide orthology predictions for species with fully sequenced genomes. Although the goal seems well-defined, the content of these databases differs greatly. Seven ortholog databases (Ensembl Compara, eggNOG, HOGENOM, InParanoid, OMA, OrthoDB, Panther) were compared on the basis of reference trees. For three well-conserved protein families, we observed a generally high specificity of orthology assignments for these databases. We show that differences in the completeness of predicted gene relationships and in the phylogenetic information are, for the great majority, not due to the methods used, but to differences in the underlying database concepts. According to our metrics, none of the databases provides a fully correct and comprehensive protein classification. Our results provide a framework for meaningful and systematic comparisons of phylogenomic databases. In the future, a sustainable set of ‘Gold standard’ phylogenetic trees could provide a robust method for phylogenomic databases to assess their current quality status, measure changes following new database releases and diagnose improvements subsequent to an upgrade of the analysis procedure.
conceptual comparison; phylogenomic databases; quality assessment; reference gene trees
Despite the common assumption that orthologs usually share the same function, there have been various reports of divergence between orthologs, even among species as close as mammals. The comparison of mouse and human is of special interest, because mouse is often used as a model organism to understand human biology. We review the literature on evidence for divergence between human and mouse orthologous genes, and discuss it in the context of biomedical research.
orthology; expression divergence; alternative splicing; copy number variants; phenotypic divergence