Xenoturbellida and Acoelomorpha are marine worms with contentious ancestry. Both were originally associated with the flatworms (Platyhelminthes), but molecular data haverevised their phylogenetic positions, generally linking Xenoturbellida to the deuterostomes1,2 and positioning the Acoelomorpha as the most basally branching bilaterian group(s)3–6. Recent phylogenomic data suggested that Xenoturbellida and Acoelomorpha are sister taxa and together constitute an early branch of Bilateria7. Here we assemble three independent data sets—mitochondrial genes, a phylogenomic data set of 38,330 amino-acid positions and new microRNA (miRNA) complements—and show that the position of Acoelomorpha is strongly affected by a long-branch attraction (LBA) artefact. When we minimize LBA we find consistent support for a position of both acoelomorphs and Xenoturbella within the deuterostomes. The most likely phylogeny links Xenoturbella and Acoelomorpha in a clade we call Xenacoelomorpha. The Xenacoelomorpha is the sister group of the Ambulacraria (hemichordates and echinoderms). We show that analyses of miRNA complements8 have been affected by character loss in the acoels and that both groups possess one miRNA and the gene Rsb66 otherwise specific to deuterostomes. In addition, Xenoturbella shares one miRNA with the ambulacrarians, and two with the acoels. This phylogeny makes sense of the shared characteristics of Xenoturbellida and Acoelomorpha, such as ciliary ultrastructure and diffuse nervous system, and implies the loss of various deuterostome characters in the Xenacoelomorpha including coelomic cavities, through gut and gill slits.
Inherited retinal degeneration (IRD) is a common cause of visual impairment (prevalence ∼1/3500). There is considerable phenotype and genotype heterogeneity, making a specific diagnosis very difficult without molecular testing. We investigated targeted capture combined with next-generation sequencing using Nimblegen 12plex arrays and the Roche 454 sequencing platform to explore its potential for clinical diagnostics in two common types of IRD, retinitis pigmentosa and cone-rod dystrophy. 50 patients (36 unknowns and 14 positive controls) were screened, and pathogenic mutations were identified in 25% of patients in the unknown, with 53% in the early-onset cases. All patients with new mutations detected had an age of onset <21 years and 44% had a family history. Thirty-one percent of mutations detected were novel. A de novo mutation in rhodopsin was identified in one early-onset case without a family history. Bioinformatic pipelines were developed to identify likely pathogenic mutations and stringent criteria were used for assignment of pathogenicity. Analysis of sequencing metrics revealed significant variability in capture efficiency and depth of coverage. We conclude that targeted capture and next-generation sequencing are likely to be very useful in a diagnostic setting, but patients with earlier onset of disease are more likely to benefit from using this strategy. The mutation-detection rate suggests that many patients are likely to have mutations in novel genes.
retinal degeneration; molecular diagnostics; next-generation sequencing
In severe early-onset epilepsy, precise clinical and molecular genetic diagnosis is complex, as many metabolic and electro-physiological processes have been implicated in disease causation. The clinical phenotypes share many features such as complex seizure types and developmental delay. Molecular diagnosis has historically been confined to sequential testing of candidate genes known to be associated with specific sub-phenotypes, but the diagnostic yield of this approach can be low. We conducted whole-genome sequencing (WGS) on six patients with severe early-onset epilepsy who had previously been refractory to molecular diagnosis, and their parents. Four of these patients had a clinical diagnosis of Ohtahara Syndrome (OS) and two patients had severe non-syndromic early-onset epilepsy (NSEOE). In two OS cases, we found de novo non-synonymous mutations in the genes KCNQ2 and SCN2A. In a third OS case, WGS revealed paternal isodisomy for chromosome 9, leading to identification of the causal homozygous missense variant in KCNT1, which produced a substantial increase in potassium channel current. The fourth OS patient had a recessive mutation in PIGQ that led to exon skipping and defective glycophosphatidyl inositol biosynthesis. The two patients with NSEOE had likely pathogenic de novo mutations in CBL and CSNK1G1, respectively. Mutations in these genes were not found among 500 additional individuals with epilepsy. This work reveals two novel genes for OS, KCNT1 and PIGQ. It also uncovers unexpected genetic mechanisms and emphasizes the power of WGS as a clinical tool for making molecular diagnoses, particularly for highly heterogeneous disorders.
Variation at regulatory elements, identified through hypersensitivity to digestion by DNase I, is believed to contribute to variation in complex traits, but the extent and consequences of this variation are poorly characterized. Analysis of terminally differentiated erythroblasts in eight inbred strains of mice identified reproducible variation at approximately 6% of DNase I hypersensitive sites (DHS). Only 30% of such variable DHS contain a sequence variant predictive of site variation. Nevertheless, sequence variants within variable DHS are more likely to be associated with complex traits than those in non-variant DHS, and variants associated with complex traits preferentially occur in variable DHS. Changes at a small proportion (less than 10%) of variable DHS are associated with changes in nearby transcriptional activity. Our results show that whilst DNA sequence variation is not the major determinant of variation in open chromatin, where such variants exist they are likely to be causal for complex traits.
Regulatory sites of the genome affect gene expression and complex traits, including disease susceptibility. Variable regulatory sites are potentially interesting because they are a likely cause of phenotypic variation, providing a bridge between sequence and transcriptional variation. In this paper we identify regions of the genome where DNA is not wrapped up in chromatin (hence potentially regulatory) in eight inbred strains of mice. We compare sites that vary among strains and compare them to non-variable sites. We show that more than half of variable sites cannot be attributed to local sequence variation. Functional consequences (in terms of readily detectable changes in gene expression) are associated with less than 10% of variable DNase I hypersensitive sites. We show that variable sites are enriched for sequence variants contributing to complex traits in mice.
Motivation: The ready availability of next-generation sequencing has led to
a situation where it is easy to produce very fragmentary genome assemblies. We present a
pipeline, SWiPS (Scaffolding With Protein Sequences), that uses orthologous proteins to
improve low quality genome assemblies. The protein sequences are used as guides to
scaffold existing contigs, while simultaneously allowing the gene structure to be
predicted by homology.
Results: To perform, SWiPS does not depend on a high N50 or whole proteins
being encoded on a single contig. We tested our algorithm on simulated next-generation
data from Ciona intestinalis, real next-generation data from
Drosophila melanogaster, a complex genome assembly of Homo
sapiens and the low coverage Sanger sequence assembly of Callorhinchus
milii. The improvements in N50 are of the order of ∼20% for the
C.intestinalis and H.sapiens assemblies, which is
significant, considering the large size of intergenic regions in these eukaryotes. Using
the CEGMA pipeline to assess the gene space represented in the genome assemblies, the
number of genes retrieved increased by >110% for C.milii and
from 20 to 40% for C.intestinalis. The scaffold error rates are
low: 85–90% of scaffolds are fully correct, and >95% of local
contig joins are correct.
Availability: SWiPS is available freely for download at http://www.well.ox.ac.uk/∼yli142/swips.html.
firstname.lastname@example.org or email@example.com
The mammalian epidermis is a continually renewing structure that provides the interface between the organism and an innately hostile environment. The keratinocyte is its principal cell. Keratinocyte proteins form a physical epithelial barrier, protect against microbial damage, and prepare immune responses to danger. Epithelial immunity is disordered in many common diseases and disordered epithelial differentiation underlies many cancers. In order to identify the genes that mediate epithelial development we used a tissue model of the skin derived from primary human keratinocytes. We measured global gene expression in triplicate at five times over the ten days that the keratinocytes took to fully differentiate. We identified 1282 gene transcripts that significantly changed during differentiation (false discovery rate <0.01%). We robustly grouped these transcripts by K-means clustering into modules with distinct temporal expression patterns, shared regulatory motifs, and biological functions. We found a striking cluster of late expressed genes that form the structural and innate immune defences of the epithelial barrier. Gene Ontology analyses showed that undifferentiated keratinocytes were characterised by genes for motility and the adaptive immune response. We systematically identified calcium-binding genes, which may operate with the epidermal calcium gradient to control keratinocyte division during skin repair. The results provide multiple novel insights into keratinocyte biology, in particular providing a comprehensive list of known and previously unrecognised major components of the epidermal barrier. The findings provide a reference for subsequent understanding of how the barrier functions in health and disease.
Comparisons between completely sequenced metazoan genomes have generally emphasized how similar their encoded protein content is, even when the comparison is between phyla. Given the manifest differences between phyla and, in particular, intuitive notions that some animals are more complex than others, this creates something of a paradox. Simplistic explanations have included arguments such as increased numbers of genes; greater numbers of protein products produced through alternative splicing; increased numbers of regulatory non-coding RNAs and increased complexity of the cis-regulatory code. An obvious value of complete genome sequences lies in their ability to provide us with inventories of such components. I examine progress being made in linking genome content to the pattern of animal evolution, and argue that the gap between genomic and phenotypic complexity can only be understood through the totality of interacting components.
comparative genomics; evolution; Metazoa; transcription factors; ultraconserved regions
Hypoxia-inducible factor (HIF) controls an extensive range of adaptive responses to hypoxia. To better understand this transcriptional cascade we performed genome-wide chromatin immunoprecipitation using antibodies to two major HIF-α subunits, and correlated the results with genome-wide transcript profiling. Within a tiled promoter array we identified 546 and 143 sequences that bound, respectively, to HIF-1α or HIF-2α at high stringency. Analysis of these sequences confirmed an identical core binding motif for HIF-1α and HIF-2α (RCGTG) but demonstrated that binding to this motif was highly selective, with binding enriched at distinct regions both upstream and downstream of the transcriptional start. Comparison of HIF-promoter binding data with bidirectional HIF-dependent changes in transcript expression indicated that whereas a substantial proportion of positive responses (>20% across all significantly regulated genes) are direct, HIF-dependent gene suppression is almost entirely indirect. Comparison of HIF-1α- versus HIF-2α-binding sites revealed that whereas some loci bound HIF-1α in isolation, many bound both isoforms with similar affinity. Despite high-affinity binding to multiple promoters, HIF-2α contributed to few, if any, of the transcriptional responses to acute hypoxia at these loci. Given emerging evidence for biologically distinct functions of HIF-1α versus HIF-2α understanding the mechanisms restricting HIF-2α activity will be of interest.
Summary: POPE (Phylogeny, Ortholog and Paralog Extractor) provides an integrated platform for automatic ortholog identification. Intermediate steps can be visualized, modified and analyzed in order to assess and improve the underlying quality of orthology and paralogy assignments.
Availability: POPE is available for download from the website: http://www.well.ox.ac.uk/~tota/pope.
Alternative splicing of genes is an efficient means of generating variation in protein function. Several disease states have been associated with rare genetic variants that affect splicing patterns. Conversely, splicing efficiency of some genes is known to vary between individuals without apparent ill effects. What is not clear is whether commonly observed phenotypic variation in splicing patterns, and hence potential variation in protein function, is to a significant extent determined by naturally occurring DNA sequence variation and in particular by single nucleotide polymorphisms (SNPs). In this study, we surveyed the splicing patterns of 250 exons in 22 individuals who had been previously genotyped by the International HapMap Project. We identified 70 simple cassette exon alternative splicing events in our experimental system; for six of these, we detected consistent differences in splicing pattern between individuals, with a highly significant association between splice phenotype and neighbouring SNPs. Remarkably, for five out of six of these events, the strongest correlation was found with the SNP closest to the intron–exon boundary, although the distance between these SNPs and the intron–exon boundary ranged from 2 bp to greater than 1,000 bp. Two of these SNPs were further investigated using a minigene splicing system, and in each case the SNPs were found to exert cis-acting effects on exon splicing efficiency in vitro. The functional consequences of these SNPs could not be predicted using bioinformatic algorithms. Our findings suggest that phenotypic variation in splicing patterns is determined by the presence of SNPs within flanking introns or exons. Effects on splicing may represent an important mechanism by which SNPs influence gene function.
Genetic variation, through its effects on gene expression, influences many aspects of the human phenotype. Understanding the impact of genetic variation on human disease risk has become a major goal for biomedical research and has the potential of revealing both novel disease mechanisms and novel functional elements controlling gene expression. Recent large-scale studies have suggested that a relatively high proportion of human genes show allele-specific variation in expression. Effects of common DNA polymorphisms on mRNA splicing are less well studied. Variation in splicing patterns is known to be tissue specific, and for a small number of genes has been shown to vary among individuals. What is not known is whether allele-specific splicing events are an important mechanism by which common genetic variation affects gene expression. In this study we show that allele-specific alternative splicing was observed in six out of 70 exon-skipping events. Sequence analysis of the relevant splice sites and of the regions surrounding single nucleotide polymorphisms correlated with the splicing events failed to identify any predictive bioinformatic signals. A genome-wide study of allele-specific splicing, using an experimental rather than a bioinformatic approach, is now required.
InterPro is an integrated resource for protein families, domains and functional sites, which integrates the following protein signature databases: PROSITE, PRINTS, ProDom, Pfam, SMART, TIGRFAMs, PIRSF, SUPERFAMILY, Gene3D and PANTHER. The latter two new member databases have been integrated since the last publication in this journal. There have been several new developments in InterPro, including an additional reading field, new database links, extensions to the web interface and additional match XML files. InterPro has always provided matches to UniProtKB proteins on the website and in the match XML file on the FTP site. Additional matches to proteins in UniParc (UniProt archive) are now available for download in the new match XML files only. The latest InterPro release (13.0) contains more than 13 000 entries, covering over 78% of all proteins in UniProtKB. The database is available for text- and sequence-based searches via a webserver (), and for download by anonymous FTP (). The InterProScan search tool is now also available via a web service at .
High-resolution genetic maps are required for mapping complex traits and for the study of recombination. We report the highest density genetic map yet created for any organism, except humans. Using more than 10,000 single nucleotide polymorphisms evenly spaced across the mouse genome, we have constructed genetic maps for both outbred and inbred mice, and separately for males and females. Recombination rates are highly correlated in outbred and inbred mice, but show relatively low correlation between males and females. Differences between male and female recombination maps and the sequence features associated with recombination are strikingly similar to those observed in humans. Genetic maps are available from http://gscan.well.ox.ac.uk/#genetic_map and as supporting information to this publication.
A high-density SNP map based on outbred and inbred mice with male and female separation suggests a high degree of homology between mouse and human recombination.
The Simple Modular Architecture Research Tool (SMART) is an online resource () used for protein domain identification and the analysis of protein domain architectures. Many new features were implemented to make SMART more accessible to scientists from different fields. The new ‘Genomic’ mode in SMART makes it easy to analyze domain architectures in completely sequenced genomes. Domain annotation has been updated with a detailed taxonomic breakdown and a prediction of the catalytic activity for 50 SMART domains is now available, based on the presence of essential amino acids. Furthermore, intrinsically disordered protein regions can be identified and displayed. The network context is now displayed in the results page for more than 350 000 proteins, enabling easy analyses of domain interactions.
The Engrailed Homology 1 (EH1) motif is a small region, believed to have evolved convergently in homeobox and forkhead containing proteins, that interacts with the Drosophila protein groucho (C. elegans unc-37, Human Transducin-like Enhancers of Split). The small size of the motif makes its reliable identification by computational means difficult. I have systematically searched the predicted proteomes of Drosophila, C. elegans and human for further instances of the motif.
Using motif identification methods and database searching techniques, I delimit which homeobox and forkhead domain containing proteins also have likely EH1 motifs. I show that despite low database search scores, there is a significant association of the motif with transcription factor function. I further show that likely EH1 motifs are found in combination with T-Box, Zinc Finger and Doublesex domains as well as discussing other plausible candidate associations. I identify strong candidate EH1 motifs in basal metazoan phyla.
Candidate EH1 motifs exist in combination with a variety of transcription factor domains, suggesting that these proteins have repressor functions. The distribution of the EH1 motif is suggestive of convergent evolution, although in many cases, the motif has been conserved throughout bilaterian orthologs. Groucho mediated repression was established prior to the evolution of bilateria.
The functional sites of a protein present important information for determining its cellular function and are fundamental in drug design. Accordingly, accurate methods for the prediction of functional sites are of immense value. Most available methods are based on a set of homologous sequences and structural or evolutionary information, and assume that functional sites are more conserved than the average. In the analysis presented here, we have investigated the conservation of location and type of amino acids at functional sites, and compared the behaviour of functional sites between different protein domains.
Functional sites were extracted from experimentally determined structural complexes from the Protein Data Bank harbouring a conserved protein domain from the SMART database. In general, functional (i.e. interacting) sites whose location is more highly conserved are also more conserved in their type of amino acid. However, even highly conserved functional sites can present a wide spectrum of amino acids. The degree of conservation strongly depends on the function of the protein domain and ranges from highly conserved in location and amino acid to very variable. Differentiation by binding partner shows that ion binding sites tend to be more conserved than functional sites binding peptides or nucleotides.
The results gained by this analysis will help improve the accuracy of functional site prediction and facilitate the characterization of unknown protein sequences.
SMART (Simple Modular Architecture Research Tool) is a web tool (http://smart.embl.de/) for the identification and annotation of protein domains, and provides a platform for the comparative study of complex domain architectures in genes and proteins. The January 2004 release of SMART contains 685 protein domains. New developments in SMART are centred on the integration of data from completed metazoan genomes. SMART now uses predicted proteins from complete genomes in its source sequence databases, and integrates these with predictions of orthology. New visualization tools have been developed to allow analysis of gene intron–exon structure within the context of protein domain structure, and to align these displays to provide schematic comparisons of orthologous genes, or multiple transcripts from the same gene. Other improvements include the ability to query SMART by Gene Ontology terms, improved structure database searching and batch retrieval of multiple entries.
InterPro, an integrated documentation resource of protein families, domains and functional sites, was created in 1999 as a means of amalgamating the major protein signature databases into one comprehensive resource. PROSITE, Pfam, PRINTS, ProDom, SMART and TIGRFAMs have been manually integrated and curated and are available in InterPro for text- and sequence-based searching. The results are provided in a single format that rationalises the results that would be obtained by searching the member databases individually. The latest release of InterPro contains 5629 entries describing 4280 families, 1239 domains, 95 repeats and 15 post-translational modifications. Currently, the combined signatures in InterPro cover more than 74% of all proteins in SWISS-PROT and TrEMBL, an increase of nearly 15% since the inception of InterPro. New features of the database include improved searching capabilities and enhanced graphical user interfaces for visualisation of the data. The database is available via a webserver (http://www.ebi.ac.uk/interpro) and anonymous FTP (ftp://ftp.ebi.ac.uk/pub/databases/interpro).
SMART (Simple Modular Architecture Research Tool, http://smart.embl-heidelberg.de) is a web-based resource used for the annotation of protein domains and the analysis of domain architectures, with particular emphasis on mobile eukaryotic domains. Extensive annotation for each domain family is available, providing information relating to function, subcellular localization, phyletic distribution and tertiary structure. The January 2002 release has added more than 200 hand-curated domain models. This brings the total to over 600 domain families that are widely represented among nuclear, signalling and extracellular proteins. Annotation now includes links to the Online Mendelian Inheritance in Man (OMIM) database in cases where a human disease is associated with one or more mutations in a particular domain. We have implemented new analysis methods and updated others. New advanced queries provide direct access to the SMART relational database using SQL. This database now contains information on intrinsic sequence features such as transmembrane regions, coiled-coils, signal peptides and internal repeats. SMART output can now be easily included in users’ documents. A SMART mirror has been created at http://smart.ox.ac.uk.
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures (http://SMART.embl-heidelberg.de ). More than 400 domain families found in signalling, extracellular and chromatin-associated proteins are detectable. These domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues. Each domain found in a non-redundant protein database as well as search parameters and taxonomic information are stored in a relational database system. User interfaces to this database allow searches for proteins containing specific combinations of domains in defined taxa.