Comparative genomics combined with phylogenetic reconstructions are powerful approaches to study the evolution of genes and genomes. However, the current rapid expansion of the volume of genomic information makes it increasingly difficult to interrogate, integrate and synthesize comparative genome data while taking into account the maximum breadth of information available. GenomicusPlants (http://www.genomicus.biologie.ens.fr/genomicus-plants) is an extension of the Genomicus webserver that addresses this issue by allowing users to explore flowering plant genomes in an intuitive way, across the broadest evolutionary scales. Extant genomes of 26 flowering plants can be analyzed, as well as 23 ancestral reconstructed genomes. Ancestral gene order provides a long-term chronological view of gene order evolution, greatly facilitating comparative genomics and evolutionary studies. Four main interfaces (‘views’) are available where: (i) PhyloView combines phylogenetic trees with comparisons of genomic loci across any number of genomes; (ii) AlignView projects loci of interest against all other genomes to visualize its topological conservation; (iii) MatrixView compares two genomes in a classical dotplot representation; and (iv) Karyoview visualizes chromosome karyotypes ‘painted’ with colours of another genome of interest. All four views are interconnected and benefit from many customizable features.
Ancestral reconstruction; Evolution; Flowering plants; Genomics; Synteny
Genomicus (http://www.dyogen.ens.fr/genomicus/) is a database and an online tool that allows easy comparative genomic visualization in >150 eukaryote genomes. It provides a way to explore spatial information related to gene organization within and between genomes and temporal relationships related to gene and genome evolution. For the specific vertebrate phylum, it also provides access to ancestral gene order reconstructions and conserved non-coding elements information. We extended the Genomicus database originally dedicated to vertebrate to four new clades, including plants, non-vertebrate metazoa, protists and fungi. This visualization tool allows evolutionary phylogenomics analysis and exploration. Here, we describe the graphical modules of Genomicus and show how it is capable of revealing differential gene loss and gain, segmental or genome duplications and study the evolution of a locus through homology relationships.
Genome Browsers are software that allow the user to view genome annotations in the context of a reference sequence, such as a chromosome, contig, scaffold, etc. The Generic Genome Browser (GBrowse) is an open source genome browser package developed as part of the Generic Model Database Project (see Unit 9.9; Stein et at., 2002). The increasing number of sequenced genomes has to a corresponding growth in the field of comparative genomics, which requires methods to view and compare multiple genomes. Using the same software framework as GBrowse, the Generic Synteny Browser (GBrowse_syn) allows the comparison of co-linear regions of multiple genomes using the familiar GBrowse-style web page. Like GBrowse, GBrowse_syn can be configured to display any organism and is currently the synteny browser used for model organisms such as C. elegans (WormBase; www.wormbase.org; see Unit 1.8) and Arabidopsis (TAIR; www.arabidopsis.org; see Unit 1.11). GBrowse_syn is part of the GBrowse software package and can be downloaded from the web and run on any unix-like operating system, such as Linux, Solaris, Mac OS X etc. GBrowse_syn is still under active development. This unit will cover installation and configuration as part of the current stable version of GBrowse (v1.71).
The recent availability of an expanding collection of genome sequences driven by technological advances has facilitated comparative genomics and in particular the identification of synteny among multiple genomes. However, the development of effective and easy-to-use methods for identifying such conserved gene clusters among multiple genomes–synteny blocks–as well as databases, which host synteny blocks from various groups of species (especially eukaryotes) and also allow users to run synteny-identification programs, lags behind.
OrthoClusterDB is a new online platform for the identification and visualization of synteny blocks. OrthoClusterDB consists of two key web pages: Run OrthoCluster and View Synteny. The Run OrthoCluster page serves as web front-end to OrthoCluster, a recently developed program for synteny block detection. Run OrthoCluster offers full control over the functionalities of OrthoCluster, such as specifying synteny block size, considering order and strandedness of genes within synteny blocks, including or excluding nested synteny blocks, handling one-to-many orthologous relationships, and comparing multiple genomes. In contrast, the View Synteny page gives access to perfect and imperfect synteny blocks precomputed for a large number of genomes, without the need for users to retrieve and format input data. Additionally, genes are cross-linked with public databases for effective browsing. For both Run OrthoCluster and View Synteny, identified synteny blocks can be browsed at the whole genome, chromosome, and individual gene level. OrthoClusterDB is freely accessible.
We have developed an online system for the identification and visualization of synteny blocks among multiple genomes. The system is freely available at .
The UCSC Cancer Genomics Browser (https://genome-cancer.ucsc.edu) comprises a suite of web-based tools to integrate, visualize and analyze cancer genomics and clinical data. The browser displays whole-genome views of genome-wide experimental measurements for multiple samples alongside their associated clinical information. Multiple data sets can be viewed simultaneously as coordinated ‘heatmap tracks’ to compare across studies or different data modalities. Users can order, filter, aggregate, classify and display data interactively based on any given feature set including clinical features, annotated biological pathways and user-contributed collections of genes. Integrated standard statistical tools provide dynamic quantitative analysis within all available data sets. The browser hosts a growing body of publicly available cancer genomics data from a variety of cancer types, including data generated from the Cancer Genome Atlas project. Multiple consortiums use the browser on confidential prepublication data enabled by private installations. Many new features have been added, including the hgMicroscope tumor image viewer, hgSignature for real-time genomic signature evaluation on any browser track, and ‘PARADIGM’ pathway tracks to display integrative pathway activities. The browser is integrated with the UCSC Genome Browser; thus inheriting and integrating the Genome Browser’s rich set of human biology and genetics data that enhances the interpretability of the cancer genomics data.
Whole-genome comparisons are highly informative regarding genome evolution and can reveal the conservation of genome organization and gene content, gene regulatory elements, and presence of species-specific genes. Initial comparative genome analyses of the human malaria parasite Plasmodium falciparum and rodent malaria parasites (RMPs) revealed a core set of 4,500 Plasmodium orthologs located in the highly syntenic central regions of the chromosomes that sharply defined the boundaries of the variable subtelomeric regions. We used composite RMP contigs, based on partial DNA sequences of three RMPs, to generate a whole-genome synteny map of P. falciparum and the RMPs. The core regions of the 14 chromosomes of P. falciparum and the RMPs are organized in 36 synteny blocks, representing groups of genes that have been stably inherited since these malaria species diverged, but whose relative organization has altered as a result of a predicted minimum of 15 recombination events. P. falciparum-specific genes and gene families are found in the variable subtelomeric regions (575 genes), at synteny breakpoints (42 genes), and as intrasyntenic indels (126 genes). Of the 168 non-subtelomeric P. falciparum genes, including two newly discovered gene families, 68% are predicted to be exported to the surface of the blood stage parasite or infected erythrocyte. Chromosomal rearrangements are implicated in the generation and dispersal of P. falciparum-specific gene families, including one encoding receptor-associated protein kinases. The data show that both synteny breakpoints and intrasyntenic indels can be foci for species-specific genes with a predicted role in host-parasite interactions and suggest that, besides rearrangements in the subtelomeric regions, chromosomal rearrangements may also be involved in the generation of species-specific gene families. A majority of these genes are expressed in blood stages, suggesting that the vertebrate host exerts a greater selective pressure than the mosquito vector, resulting in the acquisition of diversity.
Malaria, caused by the parasite Plasmodium falciparum, is one of the most devastating infectious diseases. Rodent malaria parasites (RMPs), such as P. berghei, P. chabaudi, and P. yoelii, are used as models for P. falciparum. For the use of these models in studies of human disease, insight into both the similarities and differences in the genomics and biology of these parasites is important. The availability of significant but partial genome data of the RMPs enabled the construction of a virtual composite RMP genome and its comparison with the P. falciparum genome, generating a so-called synteny map. Analysis of this map provided the desired comparative insights. A high level of conservation exists between roughly 85% of the genes at the level of content and order, but 168 P. falciparum-specific genes that disrupted the conserved genome segments were identified. The majority of these genes were predicted to play a role in host–parasite interactions. This study indicates that determination of the synteny breakpoints may help to rapidly identify the species-specific gene content of future Plasmodium genomes, providing the malaria research community with a powerful investigative tool. The findings may also be of interest to those studying chromosomal evolution.
The analysis of genome synteny is a common practice in comparative genomics. With the advent of DNA sequencing technologies, individual biologists can rapidly produce their genomic sequences of interest. Although web-based synteny visualization tools are convenient for biologists to use, none of the existing ones allow biologists to upload their own data for analysis.
We have developed the web-based Genome Synteny Viewer (GSV) that allows users to upload two data files for synteny visualization, the mandatory synteny file for specifying genomic positions of conserved regions and the optional genome annotation file. GSV presents two selected genomes in a single integrated view while still retaining the browsing flexibility necessary for exploring individual genomes. Users can browse and filter for genomic regions of interest, change the color or shape of each annotation track as well as re-order, hide or show the tracks dynamically. Additional features include downloadable images, immediate email notification and tracking of usage history. The entire GSV package is also light-weighted which enables easy local installation.
GSV provides a unique option for biologists to analyze genome synteny by uploading their own data set to a web-based comparative genome browser. A web server hosting GSV is provided at http://cas-bioinfo.cas.unt.edu/gsv, and the software is also freely available for local installations.
It has been repeatedly observed that gene order is rapidly lost in prokaryotic genomes. However, persistent synteny blocks are found when comparing more or less distant species. These genes that remain consistently adjacent are appealing candidates for the study of genome evolution and a more accurate definition of their functional role. Such studies require visualizing conserved synteny blocks in a large number of genomes at all taxonomic distances.
After comparing nearly 600 completely sequenced genomes encompassing the whole prokaryotic tree of life, the computed synteny data were assembled in a relational database, SynteBase. SynteView was designed to visualize conserved synteny blocks in a large number of genomes after choosing one of them as a reference. SynteView functions with data stored either in SynteBase or in a home-made relational database of personal data. In addition, this software can compute on-the-fly and display the distribution of synteny blocks which are conserved in pairs of genomes. This tool has been designed to provide a wealth of information on each positional orthologous gene, to be user-friendly and customizable. It is also possible to download sequences of genes belonging to these synteny blocks for further studies. SynteView is accessible through Java Webstart at .
SynteBase answers queries about gene order conservation and SynteView visualizes the obtained results in a flexible and powerful way which provides a comparative overview of the conserved synteny in a large number of genomes, whatever their taxonomic distances.
Physical maps are important tools to uncover general chromosome structure as well as to compare different plant lineages and species, helping to elucidate genome structure, evolution and possibilities regarding synteny and colinearity. The increasing production of sequence data has opened an opportunity to link information from mapping studies to the underlying sequences. Genome browsers are invaluable platforms that provide access to these sequences, including tools for genome analysis, allowing the integration of multivariate information, and thus aiding to explain the emergence of complex genomes. The present work presents a tutorial regarding the use of genome browsers to develop targeted physical mapping, providing also a general overview and examples about the possibilities regarding the use of Fluorescent In Situ Hybridization (FISH) using bacterial artificial chromosomes (BAC), simple sequence repeats (SSR) and rDNA probes, highlighting the potential of such studies for map integration and comparative genetics. As a case study, the available genome of soybean was accessed to show how the physical and in silico distribution of such sequences may be compared at different levels. Such evaluations may also be complemented by the identification of sequences beyond the detection level of cytological methods, here using members of the aquaporin gene family as an example. The proposed approach highlights the complementation power of the combination of molecular cytogenetics and computational approaches for the anchoring of coding or repetitive sequences in plant genomes using available genome browsers, helping in the determination of sequence location, arrangement and number of repeats, and also filling gaps found in computational pseudochromosome assemblies.
gene families; FISH; BAC; SSR; aquaporin; bioinformatics
Due to the lack of availability of large genomic sequences for peach or other Prunus species, the degree of synteny conservation between the Prunus species and Arabidopsis has not been systematically assessed. Using the recently available peach EST sequences that are anchored to Prunus genetic maps and to peach physical map, we analyzed the extent of conserved synteny between the Prunus and the Arabidopsis genomes. The reconstructed pseudo-ancestral Arabidopsis genome, existed prior to the proposed recent polyploidy event, was also utilized in our analysis to further elucidate the evolutionary relationship.
We analyzed the synteny conservation between the Prunus and the Arabidopsis genomes by comparing 475 peach ESTs that are anchored to Prunus genetic maps and their Arabidopsis homologs detected by sequence similarity. Microsyntenic regions were detected between all five Arabidopsis chromosomes and seven of the eight linkage groups of the Prunus reference map. An additional 1097 peach ESTs that are anchored to 431 BAC contigs of the peach physical map and their Arabidopsis homologs were also analyzed. Microsyntenic regions were detected in 77 BAC contigs. The syntenic regions from both data sets were short and contained only a couple of conserved gene pairs. The synteny between peach and Arabidopsis was fragmentary; all the Prunus linkage groups containing syntenic regions matched to more than two different Arabidopsis chromosomes, and most BAC contigs with multiple conserved syntenic regions corresponded to multiple Arabidopsis chromosomes. Using the same peach EST datasets and their Arabidopsis homologs, we also detected conserved syntenic regions in the pseudo-ancestral Arabidopsis genome. In many cases, the gene order and content of peach regions was more conserved in the ancestral genome than in the present Arabidopsis region. Statistical significance of each syntenic group was calculated using simulated Arabidopsis genome.
We report here the result of the first extensive analysis of the conserved microsynteny using DNA sequences across the Prunus genome and their Arabidopsis homologs. Our study also illustrates that both the ancestral and present Arabidopsis genomes can provide a useful resource for marker saturation and candidate gene search, as well as elucidating evolutionary relationships between species.
Fragaria vesca, a diploid strawberry species commonly known as the alpine or woodland strawberry, is a versatile experimental plant system and an emerging model for the Rosaceae family. An ancestral F. vesca genome contributed to the genome of the octoploid dessert strawberry (F. ×ananassa), and the extant genome exhibits synteny with other commercially important members of the Rosaceae family such as apple and peach. To provide a molecular description of floral organ and fruit development at the resolution of specific tissues and cell types, RNAs from flowers and early developmental stage fruit tissues of the inbred F. vesca line YW5AF7 were extracted and the resulting cDNA libraries sequenced using an Illumina HiSeq2000. To enable easy access as well as mining of this two-dimensional (stage and tissue) transcriptome dataset, a web-based database, the Strawberry Genomic Resource (SGR), was developed.
SGR is a web accessible database that contains sample description, sample statistics, gene annotation, and gene expression analysis. This information can be accessed publicly from a web-based interface at http://bioinformatics.towson.edu/strawberry/Default.aspx. The SGR website provides user friendly search and browse capabilities for all the data stored in the database. Users are able to search for genes using a gene ID or description or obtain differentially expressed genes by entering different comparison parameters. Search results can be downloaded in a tabular format compatible with Microsoft excel application. Aligned reads to individual genes and exon/intron structures are displayed using the genome browser, facilitating gene re-annotation by individual users.
The SGR database was developed to facilitate dissemination and data mining of extensive floral and fruit transcriptome data in the woodland strawberry. It enables users to mine the data in different ways to study different pathways or biological processes during reproductive development.
Strawberry; Transcriptome; RNA-seq; Database; gBrowse; Fruit; Flowers; Rosaceae
Extant genomes share regions where genes have the same order and orientation, which are thought to arise from the conservation of an ancestral order of genes during evolution. Such regions of so-called conserved synteny, or synteny blocks, must be precisely identified and quantified, as a prerequisite to better understand the evolutionary history of genomes.
Here we describe PhylDiag, a software that identifies statistically significant synteny blocks in pairwise comparisons of eukaryote genomes. Compared to previous methods, PhylDiag uses gene trees to define gene homologies, thus allowing gene deletions to be considered as events that may break the synteny. PhylDiag also accounts for gene orientations, blocks of tandem duplicates and lineage specific de novo gene births. Starting from two genomes and the corresponding gene trees, PhylDiag returns synteny blocks with gaps less than or equal to the maximum gap parameter gapmax. This parameter is theoretically estimated, and together with a utility to graphically display results, contributes to making PhylDiag a user friendly method. In addition, putative synteny blocks are subject to a statistical validation to verify that they are unlikely to be due to a random combination of genes.
We benchmark several known metrics to measure 2D-distances in a matrix of homologies and we compare PhylDiag to i-ADHoRe 3.0 on real and simulated data. We show that PhylDiag correctly identifies small synteny blocks even with insertions, deletions, incorrect annotations or micro-inversions. Finally, PhylDiag allowed us to identify the most relevant distance metric for 2D-distance calculation between homologies.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2105-15-268) contains supplementary material, which is available to authorized users.
Comparative genomics; Synteny; Synteny block; Segmental homologies; Homology; Gene order; Rearrangement; Ancestral genome; Gene tree
The recent availability of genomic sequences and BAC libraries for a large number of mammals provides an excellent opportunity for identifying comparatively-anchored markers that are useful for creating high-resolution radiation-hybrid (RH) and BAC-based comparative maps. To use these maps for multispecies genome comparison and evolutionary inference, robust bioinformatic tools are required for the identification of chromosomal regions shared between genomes and to localize the positions of evolutionary breakpoints that are the signatures of chromosomal rearrangements. Here we report an automated tool for the identification of homologous synteny blocks (HSBs) between genomes that tolerates errors common in RH comparative maps and can be used for automated whole-genome analysis of chromosome rearrangements that occur during evolution.
We developed an algorithm and software tool (SyntenyTracker) that can be used for automated definition of HSBs using pair-wise RH or gene-based comparative maps as input. To verify correct implementation of the underlying algorithm, SyntenyTracker was used to identify HSBs in the cattle and human genomes. Results demonstrated 96% agreement with HSBs defined manually using the same set of rules. A comparison of SyntenyTracker with the AutoGRAPH synteny tool was performed using identical datasets containing 14,380 genes with 1:1 orthology in human and mouse. Discrepancies between the results using the two tools and advantages of SyntenyTracker are reported.
SyntenyTracker was shown to be an efficient and accurate automated tool for defining HSBs using datasets that may contain minor errors resulting from limitations in map construction methodologies. The utility of SyntenyTracker will become more important for comparative genomics as the number of mapped and sequenced genomes increases.
Genome browsers are a common tool used by biologists to visualize genomic features including genes, polymorphisms, and many others. However, existing genome browsers and visualization tools are not well-suited to perform meaningful comparative analysis among a large number of genomes. With the increasing quantity and availability of genomic data, there is an increased burden to provide useful visualization and analysis tools for comparison of multiple collinear genomes such as the large panels of model organisms which are the basis for much of the current genetic research.
We have developed a novel web-based tool for visualizing and analyzing multiple collinear genomes. Our tool illustrates genome-sequence similarity through a mosaic of intervals representing local phylogeny, subspecific origin, and haplotype identity. Comparative analysis is facilitated through reordering and clustering of tracks, which can vary throughout the genome. In addition, we provide local phylogenetic trees as an alternate visualization to assess local variations.
Unlike previous genome browsers and viewers, ours allows for simultaneous and comparative analysis. Our browser provides intuitive selection and interactive navigation about features of interest. Dynamic visualizations adjust to scale and data content making analysis at variable resolutions and of multiple data sets more informative. We demonstrate our genome browser for an extensive set of genomic data sets composed of almost 200 distinct mouse laboratory strains.
The reconstruction of ancestral genome architectures and gene orders from homologies between extant species is a long-standing problem, considered by both cytogeneticists and bioinformaticians. A comparison of the two approaches was recently investigated and discussed in a series of papers, sometimes with diverging points of view regarding the performance of these two approaches. We describe a general methodological framework for reconstructing ancestral genome segments from conserved syntenies in extant genomes. We show that this problem, from a computational point of view, is naturally related to physical mapping of chromosomes and benefits from using combinatorial tools developed in this scope. We develop this framework into a new reconstruction method considering conserved gene clusters with similar gene content, mimicking principles used in most cytogenetic studies, although on a different kind of data. We implement and apply it to datasets of mammalian genomes. We perform intensive theoretical and experimental comparisons with other bioinformatics methods for ancestral genome segments reconstruction. We show that the method that we propose is stable and reliable: it gives convergent results using several kinds of data at different levels of resolution, and all predicted ancestral regions are well supported. The results come eventually very close to cytogenetics studies. It suggests that the comparison of methods for ancestral genome reconstruction should include the algorithmic aspects of the methods as well as the disciplinary differences in data aquisition.
No DNA molecule is preserved after a few hundred thousand years, so inferring the DNA sequence organization of ancient living organisms beyond several million years can only be achieved by computational estimations, using the similarities and differences between chromosomes of extant species. This is the scope of “paleogenomics”, and it can help to better understand how genomes have evolved until today. We propose here a computational framework to estimate contiguous segments of ancestral chromosomes, based on techniques of physical mapping that are used to infer chromosome maps of extant species when their genome is not sequenced. This framework is not guided by possible evolutionary events such as rearrangements but only proposes ancestral genome architectures. We developed a method following this framework and applied it to mammalian genomes. We inferred ancestral chromosomal regions that are stable and well supported at different levels of resolution. These ancestral chromosomal regions agree with previous cytogenetics studies and were very probably part of the genome of the common ancestor of humans, macaca, mice, dogs, and cows, living 120 million years ago. We illustrate, through comparison with other bioinformatics methods, the importance of a formal methodological background when comparing ancestral genome architecture proposals obtained from different methods.
Identifying syntenic regions, i.e., blocks of genes or other markers with evolutionary conserved order, and quantifying evolutionary relatedness between genomes in terms of chromosomal rearrangements is one of the central goals in comparative genomics. However, the analysis of synteny and the resulting assessment of genome rearrangements are sensitive to the choice of a number of arbitrary parameters that affect the detection of synteny blocks. In particular, the choice of a set of markers and the effect of different aggregation strategies, which enable coarse graining of synteny blocks and exclusion of micro-rearrangements, need to be assessed. Therefore, existing tools and resources that facilitate identification, visualization and analysis of synteny need to be further improved to provide a flexible platform for such analysis, especially in the context of multiple genomes.
We present a new tool, Cinteny, for fast identification and analysis of synteny with different sets of markers and various levels of coarse graining of syntenic blocks. Using Hannenhalli-Pevzner approach and its extensions, Cinteny also enables interactive determination of evolutionary relationships between genomes in terms of the number of rearrangements (the reversal distance). In particular, Cinteny provides: i) integration of synteny browsing with assessment of evolutionary distances for multiple genomes; ii) flexibility to adjust the parameters and re-compute the results on-the-fly; iii) ability to work with user provided data, such as orthologous genes, sequence tags or other conserved markers. In addition, Cinteny provides many annotated mammalian, invertebrate and fungal genomes that are pre-loaded and available for analysis at .
Cinteny allows one to automatically compare multiple genomes and perform sensitivity analysis for synteny block detection and for the subsequent computation of reversal distances. Cinteny can also be used to interactively browse syntenic blocks conserved in multiple genomes, to facilitate genome annotation and validation of assemblies for newly sequenced genomes, and to construct and assess phylogenomic trees.
Genome browsers have gained importance as more genomes and related genomic information become available. However, the increase of information brought about by new generation sequencing technologies is, at the same time, causing a subtle but continuous decrease in the efficiency of conventional genome browsers. Here, we present Genome Maps, a genome browser that implements an innovative model of data transfer and management. The program uses highly efficient technologies from the new HTML5 standard, such as scalable vector graphics, that optimize workloads at both server and client sides and ensure future scalability. Thus, data management and representation are entirely carried out by the browser, without the need of any Java Applet, Flash or other plug-in technology installation. Relevant biological data on genes, transcripts, exons, regulatory features, single-nucleotide polymorphisms, karyotype and so forth, are imported from web services and are available as tracks. In addition, several DAS servers are already included in Genome Maps. As a novelty, this web-based genome browser allows the local upload of huge genomic data files (e.g. VCF or BAM) that can be dynamically visualized in real time at the client side, thus facilitating the management of medical data affected by privacy restrictions. Finally, Genome Maps can easily be integrated in any web application by including only a few lines of code. Genome Maps is an open source collaborative initiative available in the GitHub repository (https://github.com/compbio-bigdata-viz/genome-maps). Genome Maps is available at: http://www.genomemaps.org.
Comparative genomic studies suggest that the modern day assemblage of ray-finned fishes have descended from an ancestral grouping of fishes that possessed 12–13 linkage groups. All jawed vertebrates are postulated to have experienced two whole genome duplications (WGD) in their ancestry (2R duplication). Salmonids have experienced one additional WGD (4R duplication event) compared to most extant teleosts which underwent a further 3R WGD compared to other vertebrates. We describe the organization of the 4R chromosomal segments of the proto-ray-finned fish karyotype in Atlantic salmon and rainbow trout based upon their comparative syntenies with two model species of 3R ray-finned fishes.
Evidence is presented for the retention of large whole-arm affinities between the ancestral linkage groups of the ray-finned fishes, and the 50 homeologous chromosomal segments in Atlantic salmon and rainbow trout. In the comparisons between the two salmonid species, there is also evidence for the retention of large whole-arm homeologous affinities that are associated with the retention of duplicated markers. Five of the 7 pairs of chromosomal arm regions expressing the highest level of duplicate gene expression in rainbow trout share homologous synteny to the 5 pairs of homeologs with the greatest duplicate gene expression in Atlantic salmon. These regions are derived from proto-Actinopterygian linkage groups B, C, E, J and K.
Two chromosome arms in Danio rerio and Oryzias latipes (descendants of the 3R duplication) can, in most instances be related to at least 4 whole or partial chromosomal arms in the salmonid species. Multiple arm assignments in the two salmonid species do not clearly support a 13 proto-linkage group model, and suggest that a 12 proto-linkage group arrangement (i.e., a separate single chromosome duplication and ancestral fusion/fissions/recombination within the putative G/H/I groupings) may have occurred in the more basal soft-rayed fishes. We also found evidence supporting the model that ancestral linkage group M underwent a single chromosome duplication following the 3R duplication. In the salmonids, the M ancestral linkage groups are localized to 5 whole arm, and 3 partial arm regions (i.e., 6 whole arm regions expected). Thus, 3 distinct ancestral linkage groups are postulated to have existed in the G/H and M lineage chromosomes in the ancestor of the salmonids.
Genome duplications increase genetic diversity and may facilitate the evolution of gene subfunctions. Little attention, however, has focused on the evolutionary impact of lineage-specific gene loss. Here, we show that identifying lineage-specific gene loss after genome duplication is important for understanding the evolution of gene subfunctions in surviving paralogs and for improving functional connectivity among human and model organism genomes. We examine the general principles of gene loss following duplication, coupled with expression analysis of the retinaldehyde dehydrogenase Aldh1a gene family during retinoic acid signaling in eye development as a case study. Humans have three ALDH1A genes, but teleosts have just one or two. We used comparative genomics and conserved syntenies to identify loss of ohnologs (paralogs derived from genome duplication) and to clarify uncertain phylogenies. Analysis showed that Aldh1a1 and Aldh1a2 form a clade that is sister to Aldh1a3-related genes. Genome comparisons showed secondarily loss of aldh1a1 in teleosts, revealing that Aldh1a1 is not a tetrapod innovation and that aldh1a3 was recently lost in medaka, making it the first known vertebrate with a single aldh1a gene. Interestingly, results revealed asymmetric distribution of surviving ohnologs between co-orthologous teleost chromosome segments, suggesting that local genome architecture can influence ohnolog survival. We propose a model that reconstructs the chromosomal history of the Aldh1a family in the ancestral vertebrate genome, coupled with the evolution of gene functions in surviving Aldh1a ohnologs after R1, R2, and R3 genome duplications. Results provide evidence for early subfunctionalization and late subfunction-partitioning and suggest a mechanistic model based on altered regulation leading to heterochronic gene expression to explain the acquisition or modification of subfunctions by surviving ohnologs that preserve unaltered ancestral developmental programs in the face of gene loss.
Gene duplication may facilitate the acquisition of genetic diversity. Little is known, however, about the impact of gene loss on the functions of surviving genes. When a gene is lost, can other closely related genes evolve to perform the functions of the lost gene? Answering this question can be difficult because the proof for gene loss is based on negative evidence and thus can easily pass unnoticed. Here, we illustrate how the comparison of genomic neighborhoods in different species can help reconstruct the chromosomal history of a gene family and provide robust evidence for gene loss, even without an appropriate early-diverging comparator group. Identifying gene loss is important because it helps distinguish between gene gain as a lineage-specific innovation and gene loss as a lineage-specific simplification. As a case study, we investigated the expression of the Aldh1a family, which is crucial for retinoic acid signaling in development of eyes, limbs, the brain, and in cancer. Results showed that gene loss is indeed associated with the evolution of functional change in surviving gene family members. Our results highlight the relevance of comparative genomics for identifying gene loss and improving the functional connectivity among human and model organism genomes.
Autism is a highly heritable complex neurodevelopmental disorder, therefore identifying its genetic basis has been challenging. To date, numerous susceptibility genes and chromosomal abnormalities have been reported in association with autism, but most discoveries either fail to be replicated or account for a small effect. Thus, in most cases the underlying causative genetic mechanisms are not fully understood. In the present work, the Autism Genetic Database (AGD) was developed as a literature-driven, web-based, and easy to access database designed with the aim of creating a comprehensive repository for all the currently reported genes and genomic copy number variations (CNVs) associated with autism in order to further facilitate the assessment of these autism susceptibility genetic factors.
AGD is a relational database that organizes data resulting from exhaustive literature searches for reported susceptibility genes and CNVs associated with autism. Furthermore, genomic information about human fragile sites and noncoding RNAs was also downloaded and parsed from miRBase, snoRNA-LBME-db, piRNABank, and the MIT/ICBP siRNA database. A web client genome browser enables viewing of the features while a web client query tool provides access to more specific information for the features. When applicable, links to external databases including GenBank, PubMed, miRBase, snoRNA-LBME-db, piRNABank, and the MIT siRNA database are provided.
AGD comprises a comprehensive list of susceptibility genes and copy number variations reported to-date in association with autism, as well as all known human noncoding RNA genes and fragile sites. Such a unique and inclusive autism genetic database will facilitate the evaluation of autism susceptibility factors in relation to known human noncoding RNAs and fragile sites, impacting on human diseases. As a result, this new autism database offers a valuable tool for the research community to evaluate genetic findings for this complex multifactorial disorder in an integrated format. AGD provides a genome browser and a web based query client for conveniently selecting features of interest. Access to AGD is freely available at .
Advances in high-throughput sequencing technology have yielded a large number of publicly available vertebrate genomes, many of which are selected for inclusion in NCBI’s RefSeq project and subsequently processed by NCBI’s eukaryotic annotation pipeline. Genome annotation results are affected by differences in available support evidence and may be impacted by annotation pipeline software changes over time. The RefSeq project has not previously assessed annotation trends across organisms or over time. To address this deficiency, we have developed a comparative protocol which integrates analysis of annotated protein-coding regions across a data set of vertebrate orthologs in genomic sequence coordinates, protein sequences, and protein features.
We assessed an ortholog dataset that includes 34 annotated vertebrate RefSeq genomes including human. We confirm that RefSeq protein-coding gene annotations in mammals exhibit considerable similarity. Over 50% of the orthologous protein-coding genes in 20 organisms are supported at the level of splicing conservation with at least three selected reference genomes. Approximately 7,500 ortholog sets include at least half of the analyzed organisms, show highly similar sequence and conserved splicing, and may serve as a minimal set of mammalian “core proteins” for initial assessment of new mammalian genomes. Additionally, 80% of the proteins analyzed pass a suite of tests to detect proteins that lack splicing conservation and have unusual sequence or domain annotation. We use these tests to define an annotation quality metric that is based directly on the annotated proteins thus operates independently of other quality metrics such as availability of transcripts or assembly quality measures. Results are available on the RefSeq FTP site [http://ftp.ncbi.nlm.nih.gov/refseq/supplemental/ProtCore/SM1.txt].
Our multi-factored analysis demonstrates a high level of consistency in RefSeq protein representation among vertebrates. We find that the majority of the RefSeq vertebrate proteins for which we have calculated orthology are good as measured by these metrics. The process flow described provides specific information on the scope and degree of conservation for the analyzed protein sequences and annotations and will be used to enrich the quality of RefSeq records by identifying targets for further improvement in the computational annotation pipeline, and by flagging specific genes for manual curation.
Genome annotation; RefSeq proteins; Protein quality assessment; Splice orthologs
Summary: Genomes undergo large structural changes that alter their organization. The chromosomal regions affected by these rearrangements are called breakpoints, while those which have not been rearranged are called synteny blocks. Lemaitre et al. presented a new method to precisely delimit rearrangement breakpoints in a genome by comparison with the genome of a related species. Receiving as input a list of one2one orthologous genes found in the genomes of two species, the method builds a set of reliable and non-overlapping synteny blocks and refines the regions that are not contained into them. Through the alignment of each breakpoint sequence against its specific orthologous sequences in the other species, we can look for weak similarities inside the breakpoint, thus extending the synteny blocks and narrowing the breakpoints. The identification of the narrowed breakpoints relies on a segmentation algorithm and is statistically assessed. Here, we present the package Cassis that implements this method of precise detection of genomic rearrangement breakpoints.
Availability: Perl and R scripts are freely available for download at http://pbil.univ-lyon1.fr/software/Cassis/. Documentation with methodological background, technical aspects, download and setup instructions, as well as examples of applications are available together with the package. The package was tested on Linux and Mac OS environments and is distributed under the GNU GPL License.
Supplementary information: Supplementary data are available at Bioinformatics online.
The "small phylogeny" problem consists in inferring ancestral genomes associated with each internal node of a phylogenetic tree of a set of extant species. Existing methods can be grouped into two main categories: the distance-based methods aiming at minimizing a total branch length, and the synteny-based (or mapping) methods that first predict a collection of relations between ancestral markers in term of "synteny", and then assemble this collection into a set of Contiguous Ancestral Regions (CARs). The predicted CARs are likely to be more reliable as they are more directly deduced from observed conservations in extant species. However the challenge is to end up with a completely assembled genome.
We develop a new synteny-based method that is flexible enough to handle a model of evolution involving whole genome duplication events, in addition to rearrangements, gene insertions, and losses. Ancestral relationships between markers are defined in term of Gapped Adjacencies, i.e. pairs of markers separated by up to a given number of markers. It improves on a previous restricted to direct adjacencies, which revealed a high accuracy for adjacency prediction, but with the drawback of being overly conservative, i.e. of generating a large number of CARs. Applying our algorithm on various simulated data sets reveals good performance as we usually end up with a completely assembled genome, while keeping a low error rate.
All source code is available at http://www.iro.umontreal.ca/~mabrouk.
The Rat Genome Database (RGD) (
http://rgd.mcw.edu/) is the premier site for comprehensive data on the different strains of the laboratory rat (Rattus norvegicus). The strain data are collected from various publications, direct submissions from individual researchers, and rat providers worldwide. Rat strain, substrain designation and nomenclature follow the Guidelines for Nomenclature of Mouse and Rat Strains, instituted by the International Committee on Standardized Genetic Nomenclature for Mice. While symbols and names aid in identifying strains correctly, the flat nature of this information prohibits easy search and retrieval, as well as other data mining functions. In order to improve these functionalities, particularly in ontology-based tools, the Rat Strain Ontology (RS) was developed.
The Rat Strain Ontology (RS) reflects the breeding history, parental background, and genetic manipulation of rat strains. This controlled vocabulary organizes strains by type: inbred, outbred, chromosome altered, congenic, mutant and so on. In addition, under the chromosome altered category, strains are organized by chromosome, and further by type of manipulations, such as mutant or congenic. This allows users to easily retrieve strains of interest with modifications in specific genomic regions. The ontology was developed using the Open Biological and Biomedical Ontology (OBO) file format, and is organized on the Directed Acyclic Graph (DAG) structure. Rat Strain Ontology IDs are included as part of the strain report (RS: ######).
As rat researchers are often unaware of the number of substrains or altered strains within a breeding line, this vocabulary now provides an easy way to retrieve all substrains and accompanying information. Its usefulness is particularly evident in tools such as the PhenoMiner at RGD, where users can now easily retrieve phenotype measurement data for related strains, strains with similar backgrounds or those with similar introgressed regions. This controlled vocabulary also allows better retrieval and filtering for QTLs and in genomic tools such as the GViewer.
The Rat Strain Ontology has been incorporated into the RGD Ontology Browser (
http://rgd.mcw.edu/rgdweb/ontology/view.html?acc_id=RS:0000457#s) and is available through the National Center for Biomedical Ontology (
http://bioportal.bioontology.org/ontologies/1150) or the RGD ftp site (
Rat strains; Phylogeny; RGD; Rat genome database
SAR11 is an ancient and diverse clade of heterotrophic bacteria that are abundant throughout the world’s oceans, where they play a major role in the ocean carbon cycle. Correlations between the phylogenetic branching order and spatiotemporal patterns in cell distributions from planktonic ocean environments indicate that SAR11 has evolved into perhaps a dozen or more specialized ecotypes that span evolutionary distances equivalent to a bacterial order. We isolated and sequenced genomes from diverse SAR11 cultures that represent three major lineages and encompass the full breadth of the clade. The new data expand observations about genome evolution and gene content that previously had been restricted to the SAR11 Ia subclade, providing a much broader perspective on the clade’s origins, evolution, and ecology. We found small genomes throughout the clade and a very high proportion of core genome genes (48 to 56%), indicating that small genome size is probably an ancestral characteristic. In their level of core genome conservation, the members of SAR11 are outliers, the most conserved free-living bacteria known. Shared features of the clade include low GC content, high gene synteny, a large hypervariable region bounded by rRNA genes, and low numbers of paralogs. Variation among the genomes included genes for phosphorus metabolism, glycolysis, and C1 metabolism, suggesting that adaptive specialization in nutrient resource utilization is important to niche partitioning and ecotype divergence within the clade. These data provide support for the conclusion that streamlining selection for efficient cell replication in the planktonic habitat has occurred throughout the evolution and diversification of this clade.
The SAR11 clade is the most abundant group of marine microorganisms worldwide, making them key players in the global carbon cycle. Growing knowledge about their biochemistry and metabolism is leading to a more mechanistic understanding of organic carbon oxidation and sequestration in the oceans. The discovery of small genomes in SAR11 provided crucial support for the theory that streamlining selection can drive genome reduction in low-nutrient environments. Study of isolates in culture revealed atypical organic nutrient requirements that can be attributed to genome reduction, such as conditional auxotrophy for glycine and its precursors, a requirement for reduced sulfur compounds, and evidence for widespread cycling of C1 compounds in marine environments. However, understanding the genetic variation and distribution of such pathways and characteristics like streamlining throughout the group has required the isolation and genome sequencing of diverse SAR11 representatives, an analysis of which we provide here.