The Model Organism Protein Expression Database (MOPED, http://moped.proteinspire.org), is an expanding proteomics resource to enable biological and biomedical discoveries. MOPED aggregates simple, standardized and consistently processed summaries of protein expression and metadata from proteomics (mass spectrometry) experiments from human and model organisms (mouse, worm and yeast). The latest version of MOPED adds new estimates of protein abundance and concentration, as well as relative (differential) expression data. MOPED provides a new updated query interface that allows users to explore information by organism, tissue, localization, condition, experiment, or keyword. MOPED supports the Human Proteome Project’s efforts to generate chromosome and diseases specific proteomes by providing links from proteins to chromosome and disease information, as well as many complementary resources. MOPED supports a new omics metadata checklist in order to harmonize data integration, analysis and use. MOPED’s development is driven by the user community, which spans 90 countries guiding future development that will transform MOPED into a multi-omics resource. MOPED encourages users to submit data in a simple format. They can use the metadata a checklist generate a data publication for this submission. As a result, MOPED will provide even greater insights into complex biological processes and systems and enable deeper and more comprehensive biological and biomedical discoveries.
Biological processes are fundamentally driven by complex interactions between biomolecules. Integrated high-throughput omics studies enable multifaceted views of cells, organisms, or their communities. With the advent of new post-genomics technologies, omics studies are becoming increasingly prevalent; yet the full impact of these studies can only be realized through data harmonization, sharing, meta-analysis, and integrated research. These essential steps require consistent generation, capture, and distribution of metadata. To ensure transparency, facilitate data harmonization, and maximize reproducibility and usability of life sciences studies, we propose a simple common omics metadata checklist. The proposed checklist is built on the rich ontologies and standards already in use by the life sciences community. The checklist will serve as a common denominator to guide experimental design, capture important parameters, and be used as a standard format for stand-alone data publications. The omics metadata checklist and data publications will create efficient linkages between omics data and knowledge-based life sciences innovation and, importantly, allow for appropriate attribution to data generators and infrastructure science builders in the post-genomics era. We ask that the life sciences community test the proposed omics metadata checklist and data publications and provide feedback for their use and improvement.
We analyzed four families that presented with a similar condition characterized by congenital microcephaly, intellectual disability, progressive cerebral atrophy and intractable seizures. We show that recessive mutations in the ASNS gene are responsible for this syndrome. Two of the identified missense mutations dramatically reduce ASNS protein abundance, suggesting that the mutations cause loss of function. Hypomorphic Asns mutant mice have structural brain abnormalities, including enlarged ventricles and reduced cortical thickness, and show deficits in learning and memory mimicking aspects of the patient phenotype. ASNS encodes asparagine synthetase, which catalyzes the synthesis of asparagine from glutamine and aspartate. The neurological impairment resulting from ASNS deficiency may be explained by asparagine depletion in the brain, or by accumulation of aspartate/glutamate leading to enhanced excitability and neuronal damage. Our study thus indicates that asparagine synthesis is essential for the development and function of the brain but not for that of other organs.
Autophagy dysfunction has been implicated in a group of progressive neurodegenerative diseases, and has been reported to play a major role in the pathogenesis of these disorders. We have recently reported a recessive mutation in TECPR2, an autophagy-implicated WD repeat-containing protein, in five individuals with a novel form of monogenic hereditary spastic paraparesis (HSP). We found that diseased skin fibroblasts had a decreased accumulation of the autophagy-initiation protein MAP1LC3B/LC3B, and an attenuated delivery of both LC3B and the cargo-recruiting protein SQSTM1/p62 to the lysosome where they are subject to degradation. The discovered TECPR2 mutation reveals for the first time a role for aberrant autophagy in a major class of Mendelian neurodegenerative diseases, and suggests mechanisms by which impaired autophagy may impinge on a broader scope of neurodegeneration.
hereditary spastic paraparesis; autophagy; endosomal trafficking; exome sequencing; MAP1LC3; skin fibroblasts
Genetic variations in olfactory receptors likely contribute to the diversity of odorant-specific sensitivity phenotypes. Our working hypothesis is that genetic variations in auxiliary olfactory genes, including those mediating transduction and sensory neuronal development, may constitute the genetic basis for general olfactory sensitivity (GOS) and congenital general anosmia (CGA). We thus performed a systematic exploration for auxiliary olfactory genes and their documented variation. This included a literature survey, seeking relevant functional in vitro studies, mouse gene knockouts and human disorders with olfactory phenotypes, as well as data mining in published transcriptome and proteome data for genes expressed in olfactory tissues. In addition, we performed next-generation transcriptome sequencing (RNA-seq) of human olfactory epithelium and mouse olfactory epithelium and bulb, so as to identify sensory-enriched transcripts. Employing a global score system based on attributes of the 11 data sources utilized, we identified a list of 1,680 candidate auxiliary olfactory genes, of which 450 are shortlisted as having higher probability of a functional role. For the top-scoring 136 genes, we identified genomic variants (probably damaging single nucleotide polymorphisms, indels, and copy number deletions) gleaned from public variation repositories. This database of genes and their variants should assist in rationalizing the great interindividual variation in human overall olfactory sensitivity (http://genome.weizmann.ac.il/GOSdb).
olfactory candidate genes; congenital general anosmia; RNA-seqIntroduction
We propose an automaton, a theoretical framework that demonstrates how to improve the yield of the synthesis of branched chemical polymer reactions. This is achieved by separating substeps of the path of synthesis into compartments. We use chemical containers (chemtainers) to carry the substances through a sequence of fixed successive compartments. We describe the automaton in mathematical terms and show how it can be configured automatically in order to synthesize a given branched polymer target. The algorithm we present finds an optimal path of synthesis in linear time. We discuss how the automaton models compartmentalized structures found in cells, such as the endoplasmic reticulum and the Golgi apparatus, and we show how this compartmentalization can be exploited for the synthesis of branched polymers such as oligosaccharides. Lastly, we show examples of artificial branched polymers and discuss how the automaton can be configured to synthesize them with maximal yield.
Comprehensive disease classification, integration and annotation are crucial for biomedical discovery. At present, disease compilation is incomplete, heterogeneous and often lacking systematic inquiry mechanisms. We introduce MalaCards, an integrated database of human maladies and their annotations, modeled on the architecture and strategy of the GeneCards database of human genes. MalaCards mines and merges 44 data sources to generate a computerized card for each of 16 919 human diseases. Each MalaCard contains disease-specific prioritized annotations, as well as inter-disease connections, empowered by the GeneCards relational database, its searches and GeneDecks set analyses. First, we generate a disease list from 15 ranked sources, using disease-name unification heuristics. Next, we use four schemes to populate MalaCards sections: (i) directly interrogating disease resources, to establish integrated disease names, synonyms, summaries, drugs/therapeutics, clinical features, genetic tests and anatomical context; (ii) searching GeneCards for related publications, and for associated genes with corresponding relevance scores; (iii) analyzing disease-associated gene sets in GeneDecks to yield affiliated pathways, phenotypes, compounds and GO terms, sorted by a composite relevance score and presented with GeneCards links; and (iv) searching within MalaCards itself, e.g. for additional related diseases and anatomical context. The latter forms the basis for the construction of a disease network, based on shared MalaCards annotations, embodying associations based on etiology, clinical features and clinical conditions. This broadly disposed network has a power-law degree distribution, suggesting that this might be an inherent property of such networks. Work in progress includes hierarchical malady classification, ontological mapping and disease set analyses, striving to make MalaCards an even more effective tool for biomedical research.
Paenibacillus dendritiformis is a Gram-positive, soil-dwelling, spore-forming social microorganism. An intriguing collective faculty of this strain is manifested by its ability to switch between different morphotypes, such as the branching (T) and the chiral (C) morphotypes. Here we report the 6.3-Mb draft genome sequence of the P. dendritiformis C454 chiral morphotype.
Information on nucleotide diversity along completely sequenced human genomes has increased tremendously over the last few years. This makes it possible to reassess the diversity status of distinct receptor proteins in different human individuals. To this end, we focused on the complete inventory of human olfactory receptor coding regions as a model for personal receptor repertoires.
By performing data-mining from public and private sources we scored genetic variations in 413 intact OR loci, for which one or more individuals had an intact open reading frame. Using 1000 Genomes Project haplotypes, we identified a total of 4069 full-length polypeptide variants encoded by these OR loci, average of ~10 per locus, constituting a lower limit for the effective human OR repertoire. Each individual is found to harbor as many as 600 OR allelic variants, ~50% higher than the locus count. Because OR neuronal expression is allelically excluded, this has direct effect on smell perception diversity of the species. We further identified 244 OR segregating pseudogenes (SPGs), loci showing both intact and pseudogene forms in the population, twenty-six of which are annotatively “resurrected” from a pseudogene status in the reference genome. Using a custom SNP microarray we validated 150 SPGs in a cohort of 468 individuals, with every individual genome averaging 36 disrupted sequence variations, 15 in homozygote form. Finally, we generated a multi-source compendium of 63 OR loci harboring deletion Copy Number Variations (CNVs). Our combined data suggest that 271 of the 413 intact OR loci (66%) are affected by nonfunctional SNPs/indels and/or CNVs.
These results portray a case of unusually high genetic diversity, and suggest that individual humans have a highly personalized inventory of functional olfactory receptors, a conclusion that might apply to other receptor multigene families.
Olfactory receptor; Genetic polymorphism; Haplotypes; Single nucleotide polymorphism; Copy number variation; Olfaction; Gene family
Many reports in different populations have demonstrated linkage of the 10q24–q26 region to schizophrenia, thus encouraging further analysis of this locus for detection of specific schizophrenia genes. Our group previously reported linkage of the 10q24–q26 region to schizophrenia in a unique, homogeneous sample of Arab-Israeli families with multiple schizophrenia-affected individuals, under a dominant model of inheritance. To further explore this candidate region and identify specific susceptibility variants within it, we performed re-analysis of the 10q24-26 genotype data, taken from our previous genome-wide association study (GWAS) (Alkelai et al, 2011). We analyzed 2089 SNPs in an extended sample of 57 Arab Israeli families (189 genotyped individuals), under the dominant model of inheritance, which best fits this locus according to previously performed MOD score analysis. We found significant association with schizophrenia of the TCF7L2 gene intronic SNP, rs12573128, (p = 7.01×10−6) and of the nearby intergenic SNP, rs1033772, (p = 6.59×10−6) which is positioned between TCF7L2 and HABP2. TCF7L2 is one of the best confirmed susceptibility genes for type 2 diabetes (T2D) among different ethnic groups, has a role in pancreatic beta cell function and may contribute to the comorbidity of schizophrenia and T2D. These preliminary results independently support previous findings regarding a possible role of TCF7L2 in susceptibility to schizophrenia, and strengthen the importance of integrating linkage analysis models of inheritance while performing association analyses in regions of interest. Further validation studies in additional populations are required.
Large numbers of mass spectrometry proteomics studies are being conducted to understand all types of biological processes. The size and complexity of proteomics data hinders efforts to easily share, integrate, query and compare the studies. The Model Organism Protein Expression Database (MOPED, htttp://moped.proteinspire.org) is a new and expanding proteomics resource that enables rapid browsing of protein expression information from publicly available studies on humans and model organisms. MOPED is designed to simplify the comparison and sharing of proteomics data for the greater research community. MOPED uniquely provides protein level expression data, meta-analysis capabilities and quantitative data from standardized analysis. Data can be queried for specific proteins, browsed based on organism, tissue, localization and condition and sorted by false discovery rate and expression. MOPED empowers users to visualize their own expression data and compare it with existing studies. Further, MOPED links to various protein and pathway databases, including GeneCards, Entrez, UniProt, KEGG and Reactome. The current version of MOPED contains over 43 000 proteins with at least one spectral match and more than 11 million high certainty spectra.
The zebra finch is an important model organism in several fields1,2 with unique relevance to human neuroscience3,4. Like other songbirds, the zebra finch communicates through learned vocalizations, an ability otherwise documented only in humans and a few other animals and lacking in the chicken5—the only bird with a sequenced genome until now6. Here we present a structural, functional and comparative analysis of the genome sequence of the zebra finch (Taeniopygia guttata), which is a songbird belonging to the large avian order Passeriformes7. We find that the overall structures of the genomes are similar in zebra finch and chicken, but they differ in many intrachromosomal rearrangements, lineage-specific gene family expansions, the number of long-terminal-repeat-based retrotransposons, and mechanisms of sex chromosome dosage compensation. We show that song behaviour engages gene regulatory networks in the zebra finch brain, altering the expression of long non-coding RNAs, microRNAs, transcription factors and their targets. We also show evidence for rapid molecular evolution in the songbird lineage of genes that are regulated during song experience. These results indicate an active involvement of the genome in neural processes underlying vocal communication and identify potential genetic substrates for the evolution and regulation of this behaviour.
Since 1998, the bioinformatics, systems biology, genomics and medical communities have enjoyed a synergistic relationship with the GeneCards database of human genes (http://www.genecards.org). This human gene compendium was created to help to introduce order into the increasing chaos of information flow. As a consequence of viewing details and deep links related to specific genes, users have often requested enhanced capabilities, such that, over time, GeneCards has blossomed into a suite of tools (including GeneDecks, GeneALaCart, GeneLoc, GeneNote and GeneAnnot) for a variety of analyses of both single human genes and sets thereof. In this paper, we focus on inhouse and external research activities which have been enabled, enhanced, complemented and, in some cases, motivated by GeneCards. In turn, such interactions have often inspired and propelled improvements in GeneCards. We describe here the evolution and architecture of this project, including examples of synergistic applications in diverse areas such as synthetic lethality in cancer, the annotation of genetic variations in disease, omics integration in a systems biology approach to kidney disease, and bioinformatics tools.
GeneCards; GeneDecks; Partner Hunter; Set Distiller; omics; genomics; human genes; database; synthetic lethality; genetic variations
We propose an innovative, integrated, cost-effective health system to combat major non-communicable diseases (NCDs), including cardiovascular, chronic respiratory, metabolic, rheumatologic and neurologic disorders and cancers, which together are the predominant health problem of the 21st century. This proposed holistic strategy involves comprehensive patient-centered integrated care and multi-scale, multi-modal and multi-level systems approaches to tackle NCDs as a common group of diseases. Rather than studying each disease individually, it will take into account their intertwined gene-environment, socio-economic interactions and co-morbidities that lead to individual-specific complex phenotypes. It will implement a road map for predictive, preventive, personalized and participatory (P4) medicine based on a robust and extensive knowledge management infrastructure that contains individual patient information. It will be supported by strategic partnerships involving all stakeholders, including general practitioners associated with patient-centered care. This systems medicine strategy, which will take a holistic approach to disease, is designed to allow the results to be used globally, taking into account the needs and specificities of local economies and health systems.
The pattern-forming bacterium Paenibacillus vortex is notable for its advanced social behavior, which is reflected in development of colonies with highly intricate architectures. Prior to this study, only two other Paenibacillus species (Paenibacillus sp. JDR-2 and Paenibacillus larvae) have been sequenced. However, no genomic data is available on the Paenibacillus species with pattern-forming and complex social motility. Here we report the de novo genome sequence of this Gram-positive, soil-dwelling, sporulating bacterium.
The complete P. vortex genome was sequenced by a hybrid approach using 454 Life Sciences and Illumina, achieving a total of 289× coverage, with 99.8% sequence identity between the two methods. The sequencing results were validated using a custom designed Agilent microarray expression chip which represented the coding and the non-coding regions. Analysis of the P. vortex genome revealed 6,437 open reading frames (ORFs) and 73 non-coding RNA genes. Comparative genomic analysis with 500 complete bacterial genomes revealed exceptionally high number of two-component system (TCS) genes, transcription factors (TFs), transport and defense related genes. Additionally, we have identified genes involved in the production of antimicrobial compounds and extracellular degrading enzymes.
These findings suggest that P. vortex has advanced faculties to perceive and react to a wide range of signaling molecules and environmental conditions, which could be associated with its ability to reconfigure and replicate complex colony architectures. Additionally, P. vortex is likely to serve as a rich source of genes important for agricultural, medical and industrial applications and it has the potential to advance the study of social microbiology within Gram-positive bacteria.
Copy-number variations (CNVs) are widespread in the human genome, but comprehensive assignments of integer locus copy-numbers (i.e., copy-number genotypes) that, for example, enable discrimination of homozygous from heterozygous CNVs, have remained challenging. Here we present CopySeq, a novel computational approach with an underlying statistical framework that analyzes the depth-of-coverage of high-throughput DNA sequencing reads, and can incorporate paired-end and breakpoint junction analysis based CNV-analysis approaches, to infer locus copy-number genotypes. We benchmarked CopySeq by genotyping 500 chromosome 1 CNV regions in 150 personal genomes sequenced at low-coverage. The assessed copy-number genotypes were highly concordant with our performed qPCR experiments (Pearson correlation coefficient 0.94), and with the published results of two microarray platforms (95–99% concordance). We further demonstrated the utility of CopySeq for analyzing gene regions enriched for segmental duplications by comprehensively inferring copy-number genotypes in the CNV-enriched >800 olfactory receptor (OR) human gene and pseudogene loci. CopySeq revealed that OR loci display an extensive range of locus copy-numbers across individuals, with zero to two copies in some OR loci, and two to nine copies in others. Among genetic variants affecting OR loci we identified deleterious variants including CNVs and SNPs affecting ∼15% and ∼20% of the human OR gene repertoire, respectively, implying that genetic variants with a possible impact on smell perception are widespread. Finally, we found that for several OR loci the reference genome appears to represent a minor-frequency variant, implying a necessary revision of the OR repertoire for future functional studies. CopySeq can ascertain genomic structural variation in specific gene families as well as at a genome-wide scale, where it may enable the quantitative evaluation of CNVs in genome-wide association studies involving high-throughput sequencing.
Human individual genome sequencing has recently become affordable, enabling highly detailed genetic sequence comparisons. While the identification and genotyping of single-nucleotide polymorphisms has already been successfully established for different sequencing platforms, the detection, quantification and genotyping of large-scale copy-number variants (CNVs), i.e., losses or gains of long genomic segments, has remained challenging. We present a computational approach that enables detecting CNVs in sequencing data and accurately identifies the actual copy-number at which DNA segments of interest occur in an individual genome. This approach enabled us to obtain novel insights into the largest human gene family – the olfactory receptors (ORs) – involved in smell perception. While previous studies reported an abundance of CNVs in ORs, our approach enabled us to globally identify absolute differences in OR gene counts that exist between humans. While several OR genes have very high gene counts, other ORs are found only once or are missing entirely in some individuals. The latter have a particularly high probability of influencing individual differences in the perception of smell, a question that future experimental efforts can now address. Furthermore, we observed differences in OR gene counts between populations, pointing at ORs that might contribute to population-specific differences in smell.
GeneCards (www.genecards.org) is a comprehensive, authoritative compendium of annotative information about human genes, widely used for nearly 15 years. Its gene-centric content is automatically mined and integrated from over 80 digital sources, resulting in a web-based deep-linked card for each of >73 000 human gene entries, encompassing the following categories: protein coding, pseudogene, RNA gene, genetic locus, cluster and uncategorized. We now introduce GeneCards Version 3, featuring a speedy and sophisticated search engine and a revamped, technologically enabling infrastructure, catering to the expanding needs of biomedical researchers. A key focus is on gene-set analyses, which leverage GeneCards’ unique wealth of combinatorial annotations. These include the GeneALaCart batch query facility, which tabulates user-selected annotations for multiple genes and GeneDecks, which identifies similar genes with shared annotations, and finds set-shared annotations by descriptor enrichment analysis. Such set-centric features address a host of applications, including microarray data analysis, cross-database annotation mapping and gene-disorder associations for drug targeting. We highlight the new Version 3 database architecture, its multi-faceted search engine, and its semi-automated quality assurance system. Data enhancements include an expanded visualization of gene expression patterns in normal and cancer tissues, an integrated alternative splicing pattern display, and augmented multi-source SNPs and pathways sections. GeneCards now provides direct links to gene-related research reagents such as antibodies, recombinant proteins, DNA clones and inhibitory RNAs and features gene-related drugs and compounds lists. We also portray the GeneCards Inferred Functionality Score annotation landscape tool for scoring a gene’s functional information status. Finally, we delineate examples of applications and collaborations that have benefited from the GeneCards suite.
Database URL: www.genecards.org
An important facet of early biological evolution is the selection of chiral enantiomers for molecules such as amino acids and sugars. The origin of this symmetry breaking is a long-standing question in molecular evolution. Previous models addressing this question include particular kinetic properties such as autocatalysis or negative cross catalysis.
We propose here a more general kinetic formalism for early enantioselection, based on our previously described Graded Autocatalysis Replication Domain (GARD) model for prebiotic evolution in molecular assemblies. This model is adapted here to the case of chiral molecules by applying symmetry constraints to mutual molecular recognition within the assembly. The ensuing dynamics shows spontaneous chiral symmetry breaking, with transitions towards stationary compositional states (composomes) enriched with one of the two enantiomers for some of the constituent molecule types. Furthermore, one or the other of the two antipodal compositional states of the assembly also shows time-dependent selection.
It follows that chiral selection may be an emergent consequence of early catalytic molecular networks rather than a prerequisite for the initiation of primeval life processes. Elaborations of this model could help explain the prevalent chiral homogeneity in present-day living cells.
This article was reviewed by Boris Rubinstein (nominated by Arcady Mushegian), Arcady Mushegian, Meir Lahav (nominated by Yitzhak Pilpel) and Sergei Maslov.
We present a draft genome sequence of the platypus, Ornithorhynchus anatinus. This monotreme exhibits a fascinating combination of reptilian and mammalian characters. For example, platypuses have a coat of fur adapted to an aquatic lifestyle; platypus females lactate, yet lay eggs; and males are equipped with venom similar to that of reptiles. Analysis of the first monotreme genome aligned these features with genetic innovations. We find that reptile and platypus venom proteins have been co-opted independently from the same gene families; milk protein genes are conserved despite platypuses laying eggs; and immune gene family expansions are directly related to platypus biology. Expansions of protein, non-protein-coding RNA and microRNA families, as well as repeat elements, are identified. Sequencing of this genome now provides a valuable resource for deep mammalian comparative analyses, as well as for monotreme biology and conservation.
Gene annotation is a pivotal component in computational genomics, encompassing prediction of gene function, expression analysis, and sequence scrutiny. Hence, quantitative measures of the annotation landscape constitute a pertinent bioinformatics tool. GeneCards® is a gene-centric compendium of rich annotative information for over 50,000 human gene entries, building upon 68 data sources, including Gene Ontology (GO), pathways, interactions, phenotypes, publications and many more.
We present the GeneCards Inferred Functionality Score (GIFtS) which allows a quantitative assessment of a gene's annotation status, by exploiting the unique wealth and diversity of GeneCards information. The GIFtS tool, linked from the GeneCards home page, facilitates browsing the human genome by searching for the annotation level of a specified gene, retrieving a list of genes within a specified range of GIFtS value, obtaining random genes with a specific GIFtS value, and experimenting with the GIFtS weighting algorithm for a variety of annotation categories. The bimodal shape of the GIFtS distribution suggests a division of the human gene repertoire into two main groups: the high-GIFtS peak consists almost entirely of protein-coding genes; the low-GIFtS peak consists of genes from all of the categories. Cluster analysis of GIFtS annotation vectors provides the classification of gene groups by detailed positioning in the annotation arena. GIFtS also provide measures which enable the evaluation of the databases that serve as GeneCards sources. An inverse correlation is found (for GIFtS>25) between the number of genes annotated by each source, and the average GIFtS value of genes associated with that source. Three typical source prototypes are revealed by their GIFtS distribution: genome-wide sources, sources comprising mainly highly annotated genes, and sources comprising mainly poorly annotated genes. The degree of accumulated knowledge for a given gene measured by GIFtS was correlated (for GIFtS>30) with the number of publications for a gene, and with the seniority of this entry in the HGNC database.
GIFtS can be a valuable tool for computational procedures which analyze lists of large set of genes resulting from wet-lab or computational research. GIFtS may also assist the scientific community with identification of groups of uncharacterized genes for diverse applications, such as delineation of novel functions and charting unexplored areas of the human genome.
The olfactory receptor gene (OR) superfamily is the largest in the human genome. The superfamily contains 390 putatively functional genes and 465 pseudogenes arranged into 18 gene families and 300 subfamilies. Even members within the same subfamily are often located on different chromosomes. OR genes are located on all autosomes except chromosome 20, plus the X chromosome but not the Y chromosome. The gene:pseudogene ratio is lowest in human, higher in chimpanzee and highest in rat and mouse — most likely reflecting the greater need of olfaction for survival in the rodent than in the human. The OR genes undergo allelic exclusion, each sensory neurone expressing usually only one odourant receptor allele; the mechanism by which this phenomenon is regulated is not yet understood. The nomenclature system (based on evolutionary divergence of genes into families and subfamilies of the OR gene superfamily) has been designed similarly to that originally used for the CYP gene superfamily.
classification of gene families and subfamilies; OR gene superfamily; CYP gene superfamily; nasal olfactory neurone; olfaction; olfactory receptor gene superfamily; allelic exclusion; opossum genome; platypus genome
Olfactory Receptors (ORs) form the largest multigene family in vertebrates. Their evolution and their expansion in the vertebrate genomes was the subject of many studies. In this paper we apply a motif-based approach to this problem in order to uncover evolutionary characteristics.
We extract deterministic motifs from ORs belonging to ten species using the MEX (Motif Extraction) algorithm, thus defining Common Peptides (CPs) characteristic to ORs. We identify species-specific CPs and show that their relative abundance is high only in fish and frog, suggesting relevance to water-soluble odorants. We estimate the origins of CPs according to the tree of life and track the gains and losses of CPs through evolution. We identify major CP gain in tetrapods and major losses in reptiles. Although the number of human ORs is less than half of the number of ORs in other mammals, the fraction of lost CPs is only 11%.
By examining the positions of CPs along the OR sequence, we find two regions that expanded only in tetrapods. Using CPs we are able to establish remote homology relations between ORs and non-OR GPCRs.
Selecting CPs according to their evolutionary age, we bicluster ORs and CPs for each species. Clean biclustering emerges when using relatively novel CPs. Evolutionary age is used to track the history of CP acquisition in the collection of mammalian OR families within HORDE (Human Olfactory Receptor Data Explorer).
The CP method provides a novel perspective that reveals interesting traits in the evolution of olfactory receptors. It is consistent with previous knowledge, and provides finer details. Using available phylogenetic trees, evolution can be rephrased in terms of CP origins.
Supplementary information is also available at
Olfactory receptors (ORs), which are involved in odorant recognition, form the largest mammalian protein superfamily. The genomic content of OR genes is considerably reduced in humans, as reflected by the relatively small repertoire size and the high fraction (∼55%) of human pseudogenes. Since several recent low-resolution surveys suggested that OR genomic loci are frequently affected by copy-number variants (CNVs), we hypothesized that CNVs may play an important role in the evolution of the human olfactory repertoire. We used high-resolution oligonucleotide tiling microarrays to detect CNVs across 851 OR gene and pseudogene loci. Examining genomic DNA from 25 individuals with ancestry from three populations, we identified 93 OR gene loci and 151 pseudogene loci affected by CNVs, generating a mosaic of OR dosages across persons. Our data suggest that ∼50% of the CNVs involve more than one OR, with the largest CNV spanning 11 loci. In contrast to earlier reports, we observe that CNVs are more frequent among OR pseudogenes than among intact genes, presumably due to both selective constraints and CNV formation biases. Furthermore, our results show an enrichment of CNVs among ORs with a close human paralog or lacking a one-to-one ortholog in chimpanzee. Interestingly, among the latter we observed an enrichment in CNV losses over gains, a finding potentially related to the known diminution of the human OR repertoire. Quantitative PCR experiments performed for 122 sampled ORs agreed well with the microarray results and uncovered 23 additional CNVs. Importantly, these experiments allowed us to uncover nine common deletion alleles that affect 15 OR genes and five pseudogenes. Comparison to the chimpanzee reference genome revealed that all of the deletion alleles are human derived, therefore indicating a profound effect of human-specific deletions on the individual OR gene content. Furthermore, these deletion alleles may be used in future genetic association studies of olfactory inter-individual differences.
Copy-number variants (CNVs) are deletions and duplications of DNA segments, responsible for most of the genome variation in mammals. To help elucidate the impact of CNVs on evolution and function, we provide a high-resolution CNV map of the largest gene superfamily in humans, i.e., the olfactory receptor (OR) gene superfamily. Our map reveals twice as many olfactory CNVs per person than previously reported, indicating considerable OR dosage variations in humans. In particular, our findings indicate that CNVs are specifically enriched among evolutionary “young” ORs, some of which originated following the human-chimpanzee split, implying that CNVs may play an important role in the gene-birth and gene-loss processes that continuously shape the human OR repertoire. Furthermore, we describe 15 OR gene loci showing frequent human-specific deletion alleles. Additionally, we present evidence for a recent non-allelic homologous recombination event involving a pair of OR genes, forming a novel fusion OR that may harbor novel odorant-binding properties. Such events may potentially relate to individual functional “holes” in the human smell-detection repertoire, and future studies will address the specific chemosensory impact of our genomic variation map.
The coevolution of environment and living organisms is well known in nature. Here, it is suggested that similar processes can take place before the onset of life, where protocellular entities, rather than full-fledged living systems, coevolve along with their surroundings. Specifically, it is suggested that the chemical composition of the environment may have governed the chemical repertoire generated within molecular assemblies, compositional protocells, while compounds generated within these protocells altered the chemical composition of the environment. We present an extension of the graded autocatalysis replication domain (GARD) model—the environment exchange polymer GARD (EE-GARD) model. In the new model, molecules, which are formed in a protocellular assembly, may be exported to the environment that surrounds the protocell. Computer simulations of the model using an infinite-sized environment showed that EE-GARD assemblies may assume several distinct quasi-stationary compositions (composomes), similar to the observations in previous variants of the GARD model. A statistical analysis suggested that the repertoire of composomes manifested by the assemblies is independent of time. In simulations with a finite environment, this was not the case. Composomes, which were frequent in the early stages of the simulation disappeared, while others emerged. The change in the frequencies of composomes was found to be correlated with changes induced on the environment by the assembly. The EE-GARD model is the first GARD model to portray a possible time evolution of the composomes repertoire.
compositional protocell; coevolution; environment; the GARD model; composomes
The olfactory receptor gene (OR) superfamily is the largest in the human genome. The superfamily contains 390 putatively functional genes and 465 pseudogenes arranged into 18 gene families and 300 subfamilies. Even members within the same subfamily are often located on different chromosomes. OR genes are located on all autosomes except chromosome 20, plus the X chromosome but not the Y chromosome. The gene:pseudogene ratio is lowest in human, higher in chimpanzee and highest in rat and mouse -- most likely reflecting the greater need of olfaction for survival in the rodent than in the human. The OR genes undergo allelic exclusion, each sensory neurone expressing usually only one odourant receptor allele; the mechanism by which this phenomenon is regulated is not yet understood. The nomenclature system (based on evolutionary divergence of genes into families and subfamilies of the OR gene superfamily) has been designed similarly to that originally used for the CYP gene superfamily.
classification of gene families and subfamilies; OR gene superfamily; CYP gene superfamily; nasal olfactory neurone; olfaction; olfactory receptor gene superfamily; allelic exclusion; opossum genome; platypus genome