Locus Reference Genomic (LRG; http://www.lrg-sequence.org/) records contain internationally recognized stable reference sequences designed specifically for reporting clinically relevant sequence variants. Each LRG is contained within a single file consisting of a stable ‘fixed’ section and a regularly updated ‘updatable’ section. The fixed section contains stable genomic DNA sequence for a genomic region, essential transcripts and proteins for variant reporting and an exon numbering system. The updatable section contains mapping information, annotation of all transcripts and overlapping genes in the region and legacy exon and amino acid numbering systems. LRGs provide a stable framework that is vital for reporting variants, according to Human Genome Variation Society (HGVS) conventions, in genomic DNA, transcript or protein coordinates. To enable translation of information between LRG and genomic coordinates, LRGs include mapping to the human genome assembly. LRGs are compiled and maintained by the National Center for Biotechnology Information (NCBI) and European Bioinformatics Institute (EBI). LRG reference sequences are selected in collaboration with the diagnostic and research communities, locus-specific database curators and mutation consortia. Currently >700 LRGs have been created, of which >400 are publicly available. The aim is to create an LRG for every locus with clinical implications.
The Kelch-like (KLHL) gene family encodes a group of proteins that generally possess a BTB/POZ domain, a BACK domain, and five to six Kelch motifs. BTB domains facilitate protein binding and dimerization. The BACK domain has no known function yet is of functional importance since mutations in this domain are associated with disease. Kelch domains form a tertiary structure of β-propellers that have a role in extracellular functions, morphology, and binding to other proteins. Presently, 42 KLHL genes have been classified by the HUGO Gene Nomenclature Committee (HGNC), and they are found across multiple human chromosomes. The KLHL family is conserved throughout evolution. Phylogenetic analysis of KLHL family members suggests that it can be subdivided into three subgroups with KLHL11 as the oldest member and KLHL9 as the youngest. Several KLHL proteins bind to the E3 ligase cullin 3 and are known to be involved in ubiquitination. KLHL genes are responsible for several Mendelian diseases and have been associated with cancer. Further investigation of this family of proteins will likely provide valuable insights into basic biology and human disease.
KLHL; Kelch domain; BTB domain; Ubiquitination; Gene family; Evolution; Mendelian disease; Gene nomenclature; Cancer
The HUGO Gene Nomenclature Committee has approved gene symbols for the majority of protein-coding genes on the human reference genome. To adequately represent regions of complex structural variation, the Genome Reference Consortium now includes alternative representations of some of these regions as part of the reference genome. Here, we describe examples of how we name novel genes in these regions and how this nomenclature is displayed on our website, http://genenames.org.
Gene nomenclature; Reference genome; Structural variants; Human
The field of transport biology has steadily grown over the past decade and is now recognized as playing an important role in manifestation and treatment of disease. The SLC (solute carrier) gene series has grown to now include 52 families and 395 transporter genes in the human genome. A list of these genes can be found at the HUGO Gene Nomenclature Committee (HGNC) website (see www.genenames.org/genefamilies/SLC). This special issue features mini-reviews for each of these SLC families written by the experts in each field. The existing online resource for solute carriers, the Bioparadigms SLC Tables (www.bioparadigms.org), has been updated and significantly extended with additional information and cross-links to other relevant databases, and the nomenclature used in this database has been validated and approved by the HGNC. In addition, the Bioparadigms SLC Tables functionality has been improved to allow easier access by the scientific community. This introduction includes: an overview of all known SLC and “non-SLC” transporter genes; a list of transporters of water soluble vitamins; a summary of recent progress in the structure determination of transporters (including GLUT1/SLC2A1); roles of transporters in human diseases and roles in drug approval and pharmaceutical perspectives.
Transporter; Carrier; Nomenclature; Solute carrier genes; SLC; Exchanger; Cotransporter; Uniporter; Symporter; Antiporter; Ion transport; Solute transport; Coupled transport; Channel; Pump; ABC transporter; Aquaporin; Water soluble vitamins; Structure; Membrane proteins; Glucose transporter; Diabetes; GLUT1; SLC2A1
The HUGO Gene Nomenclature Committee situated at the European Bioinformatics Institute assigns unique symbols and names to human genes. Since 2011, the data within our database has expanded largely owing to an increase in naming pseudogenes and non-coding RNA genes, and we now have >33 500 approved symbols. Our gene families and groups have also increased to nearly 500, with ∼45% of our gene entries associated to at least one family or group. We have also redesigned the HUGO Gene Nomenclature Committee website http://www.genenames.org creating a constant look and feel across the site and improving usability and readability for our users. The site provides a public access portal to our database with no restrictions imposed on access or the use of the data. Within this article, we review our online resources and data with particular emphasis on the updates to our website.
“Go to, let us go down, and there confound their language, that they may not understand one another's speech. …Therefore is the name of it called Babel; because the Lord did there confound the language of all the earth…”
Arabidopsis; autophagy; Caenorhabditis; genes; human; lysosome; mammalian; mouse; nomenclature; rat; stress; vacuole; Xenopus; yeast; zebrafish
The HUGO Gene Nomenclature Committee (HGNC) assigns approved gene symbols to human loci. There are currently over 33,000 approved gene symbols, the majority of which represent protein-coding genes, but we also name other locus types such as non-coding RNAs, pseudogenes and phenotypic loci. Where relevant, the HGNC organise these genes into gene families and groups. The HGNC website http://www.genenames.org/ is an online repository of HGNC-approved gene nomenclature and associated resources for human genes, and includes links to genomic, proteomic and phenotypic information. In addition to this, we also have dedicated gene family web pages and are currently expanding and generating more of these pages using data curated by the HGNC and from information derived from external resources that focus on particular gene families. Here, we review our current online resources with a particular focus on our gene family data, using it to highlight our new Gene Symbol Report and gene family data downloads.
The identification of orthologs—genes pairs descended from a common ancestor through speciation, rather than duplication—has emerged as an essential component of many bioinformatics applications, ranging from the annotation of new genomes to experimental target prioritization. Yet, the development and application of orthology inference methods is hampered by the lack of consensus on source proteomes, file formats and benchmarks. The second ‘Quest for Orthologs’ meeting brought together stakeholders from various communities to address these challenges. We report on achievements and outcomes of this meeting, focusing on topics of particular relevance to the research community at large. The Quest for Orthologs consortium is an open community that welcomes contributions from all researchers interested in orthology research and applications.
Previously, the majority of the human genome was thought to be ‘junk’ DNA with no functional purpose. Over the past decade, the field of RNA research has rapidly expanded, with a concomitant increase in the number of non-protein coding RNA (ncRNA) genes identified in this ‘junk’. Many of the encoded ncRNAs have already been shown to be essential for a variety of vital functions, and this wealth of annotated human ncRNAs requires standardised naming in order to aid effective communication. The HUGO Gene Nomenclature Committee (HGNC) is the only organisation authorised to assign standardised nomenclature to human genes. Of the 30,000 approved gene symbols currently listed in the HGNC database (http://www.genenames.org/search), the majority represent protein-coding genes; however, they also include pseudogenes, phenotypic loci and some genomic features. In recent years the list has also increased to include almost 3,000 named human ncRNA genes. HGNC is actively engaging with the RNA research community in order to provide unique symbols and names for each sequence that encodes an ncRNA. Most of the classical small ncRNA genes have now been provided with a unique nomenclature, and work on naming the long (>200 nucleotides) non-coding RNAs (lncRNAs) is ongoing.
ncRNA; RNA; nomenclature; non-protein coding
Previously, the majority of the human genome was thought to be 'junk' DNA with no functional purpose. Over the past decade, the field of RNA research has rapidly expanded, with a concomitant increase in the number of non-protein coding RNA (ncRNA) genes identified in this 'junk'. Many of the encoded ncRNAs have already been shown to be essential for a variety of vital functions, and this wealth of annotated human ncRNAs requires standardised naming in order to aid effective communication. The HUGO Gene Nomenclature Committee (HGNC) is the only organisation authorised to assign standardised nomenclature to human genes. Of the 30,000 approved gene symbols currently listed in the HGNC database (http://www.genenames.org/search), the majority represent protein-coding genes; however, they also include pseudogenes, phenotypic loci and some genomic features. In recent years the list has also increased to include almost 3,000 named human ncRNA genes. HGNC is actively engaging with the RNA research community in order to provide unique symbols and names for each sequence that encodes an ncRNA. Most of the classical small ncRNA genes have now been provided with a unique nomenclature, and work on naming the long (> 200 nucleotides) non-coding RNAs (lncRNAs) is ongoing.
ncRNA; RNA; nomenclature; non-protein coding
The HUGO Gene Nomenclature Committee (HGNC) aims to assign a unique gene symbol and name to every human gene. The HGNC database currently contains almost 30 000 approved gene symbols, over 19 000 of which represent protein-coding genes. The public website, www.genenames.org, displays all approved nomenclature within Symbol Reports that contain data curated by HGNC editors and links to related genomic, phenotypic and proteomic information. Here we describe improvements to our resources, including a new Quick Gene Search, a new List Search, an integrated HGNC BioMart and a new Statistics and Downloads facility.
Short-chain dehydrogenases/reductases (SDR) constitute one of the largest enzyme superfamilies with presently over 46 000 members. In phylogenetic comparisons, members of this superfamily show early divergence where the majority have only low pair-wise sequence identity, although sharing common structural properties. The SDR enzymes are present in virtually all genomes investigated, and in humans over 70 SDR genes have been identified. In humans, these enzymes are involved in the metabolism of a large variety of compounds, including steroid hormones, prostaglandins, retinoids, lipids and xenobiotics. It is now clear that SDRs represent one of the oldest protein families and contribute to essential functions and interactions of all forms of life. As this field continues to grow rapidly, a systematic nomenclature is essential for future annotation and reference purposes. A functional subdivision of the SDR superfamily into at least 200 SDR families based upon hidden Markov models forms a suitable foundation for such a nomenclature system, which we present in this paper using human SDRs as examples.
SDR; enzymes; nomenclature; bioinformatics; hidden Markov models
The first ‘Gene Nomenclature Across Species’ meeting was held on 12th and 13th October 2009, at the Møller Centre in Cambridge, UK. This meeting, organised and hosted by the HUGO Gene Nomenclature Committee (HGNC), brought together invited experts from the fields of gene nomenclature, phylogenetics and genome assembly and annotation. The central aim of the meeting was to discuss the issues of coordinating gene naming across vertebrates, culminating in the publication of recommendations for assigning nomenclature to genes across multiple species.
The first 'Gene Nomenclature Across Species' meeting was held on 12th and 13th October 2009, at the Møller Centre in Cambridge, UK. This meeting, organised and hosted by the HUGO Gene Nomenclature Committee (HGNC), brought together invited experts from the fields of gene nomenclature, phylogenetics and genome assembly and annotation. The central aim of the meeting was to discuss the issues of coordinating gene naming across vertebrates, culminating in the publication of recommendations for assigning nomenclature to genes across multiple species.
The expanding number of members in the various human heat shock protein (HSP) families and the inconsistencies in their nomenclature have often led to confusion. Here, we propose new guidelines for the nomenclature of the human HSP families, HSPH (HSP110), HSPC (HSP90), HSPA (HSP70), DNAJ (HSP40), and HSPB (small HSP) as well as for the human chaperonin families HSPD/E (HSP60/HSP10) and CCT (TRiC). The nomenclature is based largely on the more consistent nomenclature assigned by the HUGO Gene Nomenclature Committee and used in the National Center of Biotechnology Information Entrez Gene database for the heat shock genes. In addition to this nomenclature, we provide a list of the human Entrez Gene IDs and the corresponding Entrez Gene IDs for the mouse orthologs.
Nomenclature; Human heat shock proteins
The human X chromosome has a unique biology that was shaped by its evolution as the sex chromosome shared by males and females. We have determined 99.3% of the euchromatic sequence of the X chromosome. Our analysis illustrates the autosomal origin of the mammalian sex chromosomes, the stepwise process that led to the progressive loss of recombination between X and Y, and the extent of subsequent degradation of the Y chromosome. LINE1 repeat elements cover one-third of the X chromosome, with a distribution that is consistent with their proposed role as way stations in the process of X-chromosome inactivation. We found 1,098 genes in the sequence, of which 99 encode proteins expressed in testis and in various tumour types. A disproportionately high number of mendelian diseases are documented for the X chromosome. Of this number, 168 have been explained by mutations in 113 X-linked genes, which in many cases were characterized with the aid of the DNA sequence.
At the FASEB summer research conference on “Arf Family GTPases”, held in Il Ciocco, Italy in June, 2007, it became evident to researchers that our understanding of the family of Arf GTPase activating proteins (ArfGAPs) has grown exponentially in recent years. A common nomenclature for these genes and proteins will facilitate discovery of biological functions and possible connections to pathogenesis. Nearly 100 researchers were contacted to generate a consensus nomenclature for human ArfGAPs. This article describes the resulting consensus nomenclature and provides a brief description of each of the 10 subfamilies of 31 human genes encoding proteins containing the ArfGAP domain.
The HUGO Gene Nomenclature Committee (HGNC) aims to assign a unique and ideally meaningful name and symbol to every human gene. The HGNC database currently comprises over 24 000 public records containing approved human gene nomenclature and associated gene information. Following our recent relocation to the European Bioinformatics Institute our homepage can now be found at http://www.genenames.org, with direct links to the searchable HGNC database and other related database resources, such as the HCOP orthology search tool and manually curated gene family webpages.
The homeobox genes are a large and diverse group of genes, many of which play important roles in the embryonic development of animals. Increasingly, homeobox genes are being compared between genomes in an attempt to understand the evolution of animal development. Despite their importance, the full diversity of human homeobox genes has not previously been described.
We have identified all homeobox genes and pseudogenes in the euchromatic regions of the human genome, finding many unannotated, incorrectly annotated, unnamed, misnamed or misclassified genes and pseudogenes. We describe 300 human homeobox loci, which we divide into 235 probable functional genes and 65 probable pseudogenes. These totals include 3 genes with partial homeoboxes and 13 pseudogenes that lack homeoboxes but are clearly derived from homeobox genes. These figures exclude the repetitive DUX1 to DUX5 homeobox sequences of which we identified 35 probable pseudogenes, with many more expected in heterochromatic regions. Nomenclature is established for approximately 40 formerly unnamed loci, reflecting their evolutionary relationships to other loci in human and other species, and nomenclature revisions are proposed for around 30 other loci. We use a classification that recognizes 11 homeobox gene 'classes' subdivided into 102 homeobox gene 'families'.
We have conducted a comprehensive survey of homeobox genes and pseudogenes in the human genome, described many new loci, and revised the classification and nomenclature of homeobox genes. The classification scheme may be widely applicable to homeobox genes in other animal genomes and will facilitate comparative genomics of this important gene superclass.
The HUGO Gene Nomenclature Committee (HGNC) aims to give every human gene a unique and ideally meaningful name and symbol. The HGNC database, previously known as Genew, contains over 22 000 public records with approved human gene nomenclature and associated information. The database has undergone major improvements throughout the last year, is publicly available for online searching at and has a new custom downloads interface at .
The human genome sequence defines our inherent biological potential; the realization of the biology encoded therein requires knowledge of the function of each gene. Currently, our knowledge in this area is still limited. Several lines of investigation have been used to elucidate the structure and function of the genes in the human genome. Even so, gene prediction remains a difficult task, as the varieties of transcripts of a gene may vary to a great extent. We thus performed an exhaustive integrative characterization of 41,118 full-length cDNAs that capture the gene transcripts as complete functional cassettes, providing an unequivocal report of structural and functional diversity at the gene level. Our international collaboration has validated 21,037 human gene candidates by analysis of high-quality full-length cDNA clones through curation using unified criteria. This led to the identification of 5,155 new gene candidates. It also manifested the most reliable way to control the quality of the cDNA clones. We have developed a human gene database, called the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/). It provides the following: integrative annotation of human genes, description of gene structures, details of novel alternative splicing isoforms, non-protein-coding RNAs, functional domains, subcellular localizations, metabolic pathways, predictions of protein three-dimensional structure, mapping of known single nucleotide polymorphisms (SNPs), identification of polymorphic microsatellite repeats within human genes, and comparative results with mouse full-length cDNAs. The H-InvDB analysis has shown that up to 4% of the human genome sequence (National Center for Biotechnology Information build 34 assembly) may contain misassembled or missing regions. We found that 6.5% of the human gene candidates (1,377 loci) did not have a good protein-coding open reading frame, of which 296 loci are strong candidates for non-protein-coding RNA genes. In addition, among 72,027 uniquely mapped SNPs and insertions/deletions localized within human genes, 13,215 nonsynonymous SNPs, 315 nonsense SNPs, and 452 indels occurred in coding regions. Together with 25 polymorphic microsatellite repeats present in coding regions, they may alter protein structure, causing phenotypic effects or resulting in disease. The H-InvDB platform represents a substantial contribution to resources needed for the exploration of human biology and pathology.
An international team has systematically validated and annotated just over 21,000 human genes using full-length cDNA, thereby providing a valuable new resource for the human genetics community