Comparative genomics combined with phylogenetic reconstructions are powerful approaches to study the evolution of genes and genomes. However, the current rapid expansion of the volume of genomic information makes it increasingly difficult to interrogate, integrate and synthesize comparative genome data while taking into account the maximum breadth of information available. GenomicusPlants (http://www.genomicus.biologie.ens.fr/genomicus-plants) is an extension of the Genomicus webserver that addresses this issue by allowing users to explore flowering plant genomes in an intuitive way, across the broadest evolutionary scales. Extant genomes of 26 flowering plants can be analyzed, as well as 23 ancestral reconstructed genomes. Ancestral gene order provides a long-term chronological view of gene order evolution, greatly facilitating comparative genomics and evolutionary studies. Four main interfaces (‘views’) are available where: (i) PhyloView combines phylogenetic trees with comparisons of genomic loci across any number of genomes; (ii) AlignView projects loci of interest against all other genomes to visualize its topological conservation; (iii) MatrixView compares two genomes in a classical dotplot representation; and (iv) Karyoview visualizes chromosome karyotypes ‘painted’ with colours of another genome of interest. All four views are interconnected and benefit from many customizable features.
Ancestral reconstruction; Evolution; Flowering plants; Genomics; Synteny
Genomicus (http://www.dyogen.ens.fr/genomicus/) is a database and an online tool that allows easy comparative genomic visualization in >150 eukaryote genomes. It provides a way to explore spatial information related to gene organization within and between genomes and temporal relationships related to gene and genome evolution. For the specific vertebrate phylum, it also provides access to ancestral gene order reconstructions and conserved non-coding elements information. We extended the Genomicus database originally dedicated to vertebrate to four new clades, including plants, non-vertebrate metazoa, protists and fungi. This visualization tool allows evolutionary phylogenomics analysis and exploration. Here, we describe the graphical modules of Genomicus and show how it is capable of revealing differential gene loss and gain, segmental or genome duplications and study the evolution of a locus through homology relationships.
The analysis of genome synteny is a common practice in comparative genomics. With the advent of DNA sequencing technologies, individual biologists can rapidly produce their genomic sequences of interest. Although web-based synteny visualization tools are convenient for biologists to use, none of the existing ones allow biologists to upload their own data for analysis.
We have developed the web-based Genome Synteny Viewer (GSV) that allows users to upload two data files for synteny visualization, the mandatory synteny file for specifying genomic positions of conserved regions and the optional genome annotation file. GSV presents two selected genomes in a single integrated view while still retaining the browsing flexibility necessary for exploring individual genomes. Users can browse and filter for genomic regions of interest, change the color or shape of each annotation track as well as re-order, hide or show the tracks dynamically. Additional features include downloadable images, immediate email notification and tracking of usage history. The entire GSV package is also light-weighted which enables easy local installation.
GSV provides a unique option for biologists to analyze genome synteny by uploading their own data set to a web-based comparative genome browser. A web server hosting GSV is provided at http://cas-bioinfo.cas.unt.edu/gsv, and the software is also freely available for local installations.
The recent availability of an expanding collection of genome sequences driven by technological advances has facilitated comparative genomics and in particular the identification of synteny among multiple genomes. However, the development of effective and easy-to-use methods for identifying such conserved gene clusters among multiple genomes–synteny blocks–as well as databases, which host synteny blocks from various groups of species (especially eukaryotes) and also allow users to run synteny-identification programs, lags behind.
OrthoClusterDB is a new online platform for the identification and visualization of synteny blocks. OrthoClusterDB consists of two key web pages: Run OrthoCluster and View Synteny. The Run OrthoCluster page serves as web front-end to OrthoCluster, a recently developed program for synteny block detection. Run OrthoCluster offers full control over the functionalities of OrthoCluster, such as specifying synteny block size, considering order and strandedness of genes within synteny blocks, including or excluding nested synteny blocks, handling one-to-many orthologous relationships, and comparing multiple genomes. In contrast, the View Synteny page gives access to perfect and imperfect synteny blocks precomputed for a large number of genomes, without the need for users to retrieve and format input data. Additionally, genes are cross-linked with public databases for effective browsing. For both Run OrthoCluster and View Synteny, identified synteny blocks can be browsed at the whole genome, chromosome, and individual gene level. OrthoClusterDB is freely accessible.
We have developed an online system for the identification and visualization of synteny blocks among multiple genomes. The system is freely available at .
The UCSC Cancer Genomics Browser (https://genome-cancer.ucsc.edu) comprises a suite of web-based tools to integrate, visualize and analyze cancer genomics and clinical data. The browser displays whole-genome views of genome-wide experimental measurements for multiple samples alongside their associated clinical information. Multiple data sets can be viewed simultaneously as coordinated ‘heatmap tracks’ to compare across studies or different data modalities. Users can order, filter, aggregate, classify and display data interactively based on any given feature set including clinical features, annotated biological pathways and user-contributed collections of genes. Integrated standard statistical tools provide dynamic quantitative analysis within all available data sets. The browser hosts a growing body of publicly available cancer genomics data from a variety of cancer types, including data generated from the Cancer Genome Atlas project. Multiple consortiums use the browser on confidential prepublication data enabled by private installations. Many new features have been added, including the hgMicroscope tumor image viewer, hgSignature for real-time genomic signature evaluation on any browser track, and ‘PARADIGM’ pathway tracks to display integrative pathway activities. The browser is integrated with the UCSC Genome Browser; thus inheriting and integrating the Genome Browser’s rich set of human biology and genetics data that enhances the interpretability of the cancer genomics data.
Whole-genome comparisons are highly informative regarding genome evolution and can reveal the conservation of genome organization and gene content, gene regulatory elements, and presence of species-specific genes. Initial comparative genome analyses of the human malaria parasite Plasmodium falciparum and rodent malaria parasites (RMPs) revealed a core set of 4,500 Plasmodium orthologs located in the highly syntenic central regions of the chromosomes that sharply defined the boundaries of the variable subtelomeric regions. We used composite RMP contigs, based on partial DNA sequences of three RMPs, to generate a whole-genome synteny map of P. falciparum and the RMPs. The core regions of the 14 chromosomes of P. falciparum and the RMPs are organized in 36 synteny blocks, representing groups of genes that have been stably inherited since these malaria species diverged, but whose relative organization has altered as a result of a predicted minimum of 15 recombination events. P. falciparum-specific genes and gene families are found in the variable subtelomeric regions (575 genes), at synteny breakpoints (42 genes), and as intrasyntenic indels (126 genes). Of the 168 non-subtelomeric P. falciparum genes, including two newly discovered gene families, 68% are predicted to be exported to the surface of the blood stage parasite or infected erythrocyte. Chromosomal rearrangements are implicated in the generation and dispersal of P. falciparum-specific gene families, including one encoding receptor-associated protein kinases. The data show that both synteny breakpoints and intrasyntenic indels can be foci for species-specific genes with a predicted role in host-parasite interactions and suggest that, besides rearrangements in the subtelomeric regions, chromosomal rearrangements may also be involved in the generation of species-specific gene families. A majority of these genes are expressed in blood stages, suggesting that the vertebrate host exerts a greater selective pressure than the mosquito vector, resulting in the acquisition of diversity.
Malaria, caused by the parasite Plasmodium falciparum, is one of the most devastating infectious diseases. Rodent malaria parasites (RMPs), such as P. berghei, P. chabaudi, and P. yoelii, are used as models for P. falciparum. For the use of these models in studies of human disease, insight into both the similarities and differences in the genomics and biology of these parasites is important. The availability of significant but partial genome data of the RMPs enabled the construction of a virtual composite RMP genome and its comparison with the P. falciparum genome, generating a so-called synteny map. Analysis of this map provided the desired comparative insights. A high level of conservation exists between roughly 85% of the genes at the level of content and order, but 168 P. falciparum-specific genes that disrupted the conserved genome segments were identified. The majority of these genes were predicted to play a role in host–parasite interactions. This study indicates that determination of the synteny breakpoints may help to rapidly identify the species-specific gene content of future Plasmodium genomes, providing the malaria research community with a powerful investigative tool. The findings may also be of interest to those studying chromosomal evolution.
Genome Browsers are software that allow the user to view genome annotations in the context of a reference sequence, such as a chromosome, contig, scaffold, etc. The Generic Genome Browser (GBrowse) is an open source genome browser package developed as part of the Generic Model Database Project (see Unit 9.9; Stein et at., 2002). The increasing number of sequenced genomes has to a corresponding growth in the field of comparative genomics, which requires methods to view and compare multiple genomes. Using the same software framework as GBrowse, the Generic Synteny Browser (GBrowse_syn) allows the comparison of co-linear regions of multiple genomes using the familiar GBrowse-style web page. Like GBrowse, GBrowse_syn can be configured to display any organism and is currently the synteny browser used for model organisms such as C. elegans (WormBase; www.wormbase.org; see Unit 1.8) and Arabidopsis (TAIR; www.arabidopsis.org; see Unit 1.11). GBrowse_syn is part of the GBrowse software package and can be downloaded from the web and run on any unix-like operating system, such as Linux, Solaris, Mac OS X etc. GBrowse_syn is still under active development. This unit will cover installation and configuration as part of the current stable version of GBrowse (v1.71).
Summary: Genomes undergo large structural changes that alter their organization. The chromosomal regions affected by these rearrangements are called breakpoints, while those which have not been rearranged are called synteny blocks. Lemaitre et al. presented a new method to precisely delimit rearrangement breakpoints in a genome by comparison with the genome of a related species. Receiving as input a list of one2one orthologous genes found in the genomes of two species, the method builds a set of reliable and non-overlapping synteny blocks and refines the regions that are not contained into them. Through the alignment of each breakpoint sequence against its specific orthologous sequences in the other species, we can look for weak similarities inside the breakpoint, thus extending the synteny blocks and narrowing the breakpoints. The identification of the narrowed breakpoints relies on a segmentation algorithm and is statistically assessed. Here, we present the package Cassis that implements this method of precise detection of genomic rearrangement breakpoints.
Availability: Perl and R scripts are freely available for download at http://pbil.univ-lyon1.fr/software/Cassis/. Documentation with methodological background, technical aspects, download and setup instructions, as well as examples of applications are available together with the package. The package was tested on Linux and Mac OS environments and is distributed under the GNU GPL License.
Supplementary information: Supplementary data are available at Bioinformatics online.
Due to the lack of availability of large genomic sequences for peach or other Prunus species, the degree of synteny conservation between the Prunus species and Arabidopsis has not been systematically assessed. Using the recently available peach EST sequences that are anchored to Prunus genetic maps and to peach physical map, we analyzed the extent of conserved synteny between the Prunus and the Arabidopsis genomes. The reconstructed pseudo-ancestral Arabidopsis genome, existed prior to the proposed recent polyploidy event, was also utilized in our analysis to further elucidate the evolutionary relationship.
We analyzed the synteny conservation between the Prunus and the Arabidopsis genomes by comparing 475 peach ESTs that are anchored to Prunus genetic maps and their Arabidopsis homologs detected by sequence similarity. Microsyntenic regions were detected between all five Arabidopsis chromosomes and seven of the eight linkage groups of the Prunus reference map. An additional 1097 peach ESTs that are anchored to 431 BAC contigs of the peach physical map and their Arabidopsis homologs were also analyzed. Microsyntenic regions were detected in 77 BAC contigs. The syntenic regions from both data sets were short and contained only a couple of conserved gene pairs. The synteny between peach and Arabidopsis was fragmentary; all the Prunus linkage groups containing syntenic regions matched to more than two different Arabidopsis chromosomes, and most BAC contigs with multiple conserved syntenic regions corresponded to multiple Arabidopsis chromosomes. Using the same peach EST datasets and their Arabidopsis homologs, we also detected conserved syntenic regions in the pseudo-ancestral Arabidopsis genome. In many cases, the gene order and content of peach regions was more conserved in the ancestral genome than in the present Arabidopsis region. Statistical significance of each syntenic group was calculated using simulated Arabidopsis genome.
We report here the result of the first extensive analysis of the conserved microsynteny using DNA sequences across the Prunus genome and their Arabidopsis homologs. Our study also illustrates that both the ancestral and present Arabidopsis genomes can provide a useful resource for marker saturation and candidate gene search, as well as elucidating evolutionary relationships between species.
Fragaria vesca, a diploid strawberry species commonly known as the alpine or woodland strawberry, is a versatile experimental plant system and an emerging model for the Rosaceae family. An ancestral F. vesca genome contributed to the genome of the octoploid dessert strawberry (F. ×ananassa), and the extant genome exhibits synteny with other commercially important members of the Rosaceae family such as apple and peach. To provide a molecular description of floral organ and fruit development at the resolution of specific tissues and cell types, RNAs from flowers and early developmental stage fruit tissues of the inbred F. vesca line YW5AF7 were extracted and the resulting cDNA libraries sequenced using an Illumina HiSeq2000. To enable easy access as well as mining of this two-dimensional (stage and tissue) transcriptome dataset, a web-based database, the Strawberry Genomic Resource (SGR), was developed.
SGR is a web accessible database that contains sample description, sample statistics, gene annotation, and gene expression analysis. This information can be accessed publicly from a web-based interface at http://bioinformatics.towson.edu/strawberry/Default.aspx. The SGR website provides user friendly search and browse capabilities for all the data stored in the database. Users are able to search for genes using a gene ID or description or obtain differentially expressed genes by entering different comparison parameters. Search results can be downloaded in a tabular format compatible with Microsoft excel application. Aligned reads to individual genes and exon/intron structures are displayed using the genome browser, facilitating gene re-annotation by individual users.
The SGR database was developed to facilitate dissemination and data mining of extensive floral and fruit transcriptome data in the woodland strawberry. It enables users to mine the data in different ways to study different pathways or biological processes during reproductive development.
Strawberry; Transcriptome; RNA-seq; Database; gBrowse; Fruit; Flowers; Rosaceae
Extant genomes share regions where genes have the same order and orientation, which are thought to arise from the conservation of an ancestral order of genes during evolution. Such regions of so-called conserved synteny, or synteny blocks, must be precisely identified and quantified, as a prerequisite to better understand the evolutionary history of genomes.
Here we describe PhylDiag, a software that identifies statistically significant synteny blocks in pairwise comparisons of eukaryote genomes. Compared to previous methods, PhylDiag uses gene trees to define gene homologies, thus allowing gene deletions to be considered as events that may break the synteny. PhylDiag also accounts for gene orientations, blocks of tandem duplicates and lineage specific de novo gene births. Starting from two genomes and the corresponding gene trees, PhylDiag returns synteny blocks with gaps less than or equal to the maximum gap parameter gapmax. This parameter is theoretically estimated, and together with a utility to graphically display results, contributes to making PhylDiag a user friendly method. In addition, putative synteny blocks are subject to a statistical validation to verify that they are unlikely to be due to a random combination of genes.
We benchmark several known metrics to measure 2D-distances in a matrix of homologies and we compare PhylDiag to i-ADHoRe 3.0 on real and simulated data. We show that PhylDiag correctly identifies small synteny blocks even with insertions, deletions, incorrect annotations or micro-inversions. Finally, PhylDiag allowed us to identify the most relevant distance metric for 2D-distance calculation between homologies.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2105-15-268) contains supplementary material, which is available to authorized users.
Comparative genomics; Synteny; Synteny block; Segmental homologies; Homology; Gene order; Rearrangement; Ancestral genome; Gene tree
Functional genomics has particular promise in silkworm biology for identifying genes involved in a variety of biological functions that include: synthesis and secretion of silk, sex determination pathways, insect-pathogen interactions, chorionogenesis, molecular clocks. Wild silkmoths have hardly been the subject of detailed scientific investigations, owing largely to non-availability of molecular and genetic data on these species. As a first step, in the present study we generated large scale expressed sequence tags (EST) in three economically important species of wild silkmoths. In order to make these resources available for the use of global scientific community, an EST database called 'WildSilkbase' was developed.
WildSilkbase is a catalogue of ESTs generated from several tissues at different developmental stages of 3 economically important saturniid silkmoths, an Indian golden silkmoth, Antheraea assama, an Indian tropical tasar silkmoth, A. mylitta and eri silkmoth, Samia cynthia ricini. Currently the database is provided with 57,113 ESTs which are clustered and assembled into 4,019 contigs and 10,019 singletons. Data can be browsed and downloaded using a standard web browser. Users can search the database either by BLAST query, keywords or Gene Ontology query. There are options to carry out searches for species, tissue and developmental stage specific ESTs in BLAST page. Other features of the WildSilkbase include cSNP discovery, GO viewer, homologue finder, SSR finder and links to all other related databases. The WildSilkbase is freely available from .
A total of 14,038 putative unigenes was identified in 3 species of wild silkmoths. These genes provide important resources to gain insight into the functional and evolutionary study of wild silkmoths. We believe that WildSilkbase will be extremely useful for all those researchers working in the areas of comparative genomics, functional genomics and molecular evolution in general, and gene discovery, gene organization, transposable elements and genome variability of insect species in particular.
Autism is a highly heritable complex neurodevelopmental disorder, therefore identifying its genetic basis has been challenging. To date, numerous susceptibility genes and chromosomal abnormalities have been reported in association with autism, but most discoveries either fail to be replicated or account for a small effect. Thus, in most cases the underlying causative genetic mechanisms are not fully understood. In the present work, the Autism Genetic Database (AGD) was developed as a literature-driven, web-based, and easy to access database designed with the aim of creating a comprehensive repository for all the currently reported genes and genomic copy number variations (CNVs) associated with autism in order to further facilitate the assessment of these autism susceptibility genetic factors.
AGD is a relational database that organizes data resulting from exhaustive literature searches for reported susceptibility genes and CNVs associated with autism. Furthermore, genomic information about human fragile sites and noncoding RNAs was also downloaded and parsed from miRBase, snoRNA-LBME-db, piRNABank, and the MIT/ICBP siRNA database. A web client genome browser enables viewing of the features while a web client query tool provides access to more specific information for the features. When applicable, links to external databases including GenBank, PubMed, miRBase, snoRNA-LBME-db, piRNABank, and the MIT siRNA database are provided.
AGD comprises a comprehensive list of susceptibility genes and copy number variations reported to-date in association with autism, as well as all known human noncoding RNA genes and fragile sites. Such a unique and inclusive autism genetic database will facilitate the evaluation of autism susceptibility factors in relation to known human noncoding RNAs and fragile sites, impacting on human diseases. As a result, this new autism database offers a valuable tool for the research community to evaluate genetic findings for this complex multifactorial disorder in an integrated format. AGD provides a genome browser and a web based query client for conveniently selecting features of interest. Access to AGD is freely available at .
It has been repeatedly observed that gene order is rapidly lost in prokaryotic genomes. However, persistent synteny blocks are found when comparing more or less distant species. These genes that remain consistently adjacent are appealing candidates for the study of genome evolution and a more accurate definition of their functional role. Such studies require visualizing conserved synteny blocks in a large number of genomes at all taxonomic distances.
After comparing nearly 600 completely sequenced genomes encompassing the whole prokaryotic tree of life, the computed synteny data were assembled in a relational database, SynteBase. SynteView was designed to visualize conserved synteny blocks in a large number of genomes after choosing one of them as a reference. SynteView functions with data stored either in SynteBase or in a home-made relational database of personal data. In addition, this software can compute on-the-fly and display the distribution of synteny blocks which are conserved in pairs of genomes. This tool has been designed to provide a wealth of information on each positional orthologous gene, to be user-friendly and customizable. It is also possible to download sequences of genes belonging to these synteny blocks for further studies. SynteView is accessible through Java Webstart at .
SynteBase answers queries about gene order conservation and SynteView visualizes the obtained results in a flexible and powerful way which provides a comparative overview of the conserved synteny in a large number of genomes, whatever their taxonomic distances.
Genome browsers have gained importance as more genomes and related genomic information become available. However, the increase of information brought about by new generation sequencing technologies is, at the same time, causing a subtle but continuous decrease in the efficiency of conventional genome browsers. Here, we present Genome Maps, a genome browser that implements an innovative model of data transfer and management. The program uses highly efficient technologies from the new HTML5 standard, such as scalable vector graphics, that optimize workloads at both server and client sides and ensure future scalability. Thus, data management and representation are entirely carried out by the browser, without the need of any Java Applet, Flash or other plug-in technology installation. Relevant biological data on genes, transcripts, exons, regulatory features, single-nucleotide polymorphisms, karyotype and so forth, are imported from web services and are available as tracks. In addition, several DAS servers are already included in Genome Maps. As a novelty, this web-based genome browser allows the local upload of huge genomic data files (e.g. VCF or BAM) that can be dynamically visualized in real time at the client side, thus facilitating the management of medical data affected by privacy restrictions. Finally, Genome Maps can easily be integrated in any web application by including only a few lines of code. Genome Maps is an open source collaborative initiative available in the GitHub repository (https://github.com/compbio-bigdata-viz/genome-maps). Genome Maps is available at: http://www.genomemaps.org.
A common approach to understanding the genetic basis of complex traits is through identification of associated quantitative trait loci (QTL). Fine mapping QTLs requires several generations of backcrosses and analysis of large populations, which is time-consuming and costly effort. Furthermore, as entire genomes are being sequenced and an increasing amount of genetic and expression data are being generated, a challenge remains: linking phenotypic variation to the underlying genomic variation. To identify candidate genes and understand the molecular basis underlying the phenotypic variation of traits, bioinformatic approaches are needed to exploit information such as genetic map, expression and whole genome sequence data of organisms in biological databases.
The Sol Genomics Network (SGN, http://solgenomics.net) is a primary repository for phenotypic, genetic, genomic, expression and metabolic data for the Solanaceae family and other related Asterids species and houses a variety of bioinformatics tools. SGN has implemented a new approach to QTL data organization, storage, analysis, and cross-links with other relevant data in internal and external databases. The new QTL module, solQTL, http://solgenomics.net/qtl/, employs a user-friendly web interface for uploading raw phenotype and genotype data to the database, R/QTL mapping software for on-the-fly QTL analysis and algorithms for online visualization and cross-referencing of QTLs to relevant datasets and tools such as the SGN Comparative Map Viewer and Genome Browser. Here, we describe the development of the solQTL module and demonstrate its application.
solQTL allows Solanaceae researchers to upload raw genotype and phenotype data to SGN, perform QTL analysis and dynamically cross-link to relevant genetic, expression and genome annotations. Exploration and synthesis of the relevant data is expected to help facilitate identification of candidate genes underlying phenotypic variation and markers more closely linked to QTLs. solQTL is freely available on SGN and can be used in private or public mode.
The reconstruction of ancestral genome architectures and gene orders from homologies between extant species is a long-standing problem, considered by both cytogeneticists and bioinformaticians. A comparison of the two approaches was recently investigated and discussed in a series of papers, sometimes with diverging points of view regarding the performance of these two approaches. We describe a general methodological framework for reconstructing ancestral genome segments from conserved syntenies in extant genomes. We show that this problem, from a computational point of view, is naturally related to physical mapping of chromosomes and benefits from using combinatorial tools developed in this scope. We develop this framework into a new reconstruction method considering conserved gene clusters with similar gene content, mimicking principles used in most cytogenetic studies, although on a different kind of data. We implement and apply it to datasets of mammalian genomes. We perform intensive theoretical and experimental comparisons with other bioinformatics methods for ancestral genome segments reconstruction. We show that the method that we propose is stable and reliable: it gives convergent results using several kinds of data at different levels of resolution, and all predicted ancestral regions are well supported. The results come eventually very close to cytogenetics studies. It suggests that the comparison of methods for ancestral genome reconstruction should include the algorithmic aspects of the methods as well as the disciplinary differences in data aquisition.
No DNA molecule is preserved after a few hundred thousand years, so inferring the DNA sequence organization of ancient living organisms beyond several million years can only be achieved by computational estimations, using the similarities and differences between chromosomes of extant species. This is the scope of “paleogenomics”, and it can help to better understand how genomes have evolved until today. We propose here a computational framework to estimate contiguous segments of ancestral chromosomes, based on techniques of physical mapping that are used to infer chromosome maps of extant species when their genome is not sequenced. This framework is not guided by possible evolutionary events such as rearrangements but only proposes ancestral genome architectures. We developed a method following this framework and applied it to mammalian genomes. We inferred ancestral chromosomal regions that are stable and well supported at different levels of resolution. These ancestral chromosomal regions agree with previous cytogenetics studies and were very probably part of the genome of the common ancestor of humans, macaca, mice, dogs, and cows, living 120 million years ago. We illustrate, through comparison with other bioinformatics methods, the importance of a formal methodological background when comparing ancestral genome architecture proposals obtained from different methods.
Comparative genomic studies suggest that the modern day assemblage of ray-finned fishes have descended from an ancestral grouping of fishes that possessed 12–13 linkage groups. All jawed vertebrates are postulated to have experienced two whole genome duplications (WGD) in their ancestry (2R duplication). Salmonids have experienced one additional WGD (4R duplication event) compared to most extant teleosts which underwent a further 3R WGD compared to other vertebrates. We describe the organization of the 4R chromosomal segments of the proto-ray-finned fish karyotype in Atlantic salmon and rainbow trout based upon their comparative syntenies with two model species of 3R ray-finned fishes.
Evidence is presented for the retention of large whole-arm affinities between the ancestral linkage groups of the ray-finned fishes, and the 50 homeologous chromosomal segments in Atlantic salmon and rainbow trout. In the comparisons between the two salmonid species, there is also evidence for the retention of large whole-arm homeologous affinities that are associated with the retention of duplicated markers. Five of the 7 pairs of chromosomal arm regions expressing the highest level of duplicate gene expression in rainbow trout share homologous synteny to the 5 pairs of homeologs with the greatest duplicate gene expression in Atlantic salmon. These regions are derived from proto-Actinopterygian linkage groups B, C, E, J and K.
Two chromosome arms in Danio rerio and Oryzias latipes (descendants of the 3R duplication) can, in most instances be related to at least 4 whole or partial chromosomal arms in the salmonid species. Multiple arm assignments in the two salmonid species do not clearly support a 13 proto-linkage group model, and suggest that a 12 proto-linkage group arrangement (i.e., a separate single chromosome duplication and ancestral fusion/fissions/recombination within the putative G/H/I groupings) may have occurred in the more basal soft-rayed fishes. We also found evidence supporting the model that ancestral linkage group M underwent a single chromosome duplication following the 3R duplication. In the salmonids, the M ancestral linkage groups are localized to 5 whole arm, and 3 partial arm regions (i.e., 6 whole arm regions expected). Thus, 3 distinct ancestral linkage groups are postulated to have existed in the G/H and M lineage chromosomes in the ancestor of the salmonids.
Genome duplications increase genetic diversity and may facilitate the evolution of gene subfunctions. Little attention, however, has focused on the evolutionary impact of lineage-specific gene loss. Here, we show that identifying lineage-specific gene loss after genome duplication is important for understanding the evolution of gene subfunctions in surviving paralogs and for improving functional connectivity among human and model organism genomes. We examine the general principles of gene loss following duplication, coupled with expression analysis of the retinaldehyde dehydrogenase Aldh1a gene family during retinoic acid signaling in eye development as a case study. Humans have three ALDH1A genes, but teleosts have just one or two. We used comparative genomics and conserved syntenies to identify loss of ohnologs (paralogs derived from genome duplication) and to clarify uncertain phylogenies. Analysis showed that Aldh1a1 and Aldh1a2 form a clade that is sister to Aldh1a3-related genes. Genome comparisons showed secondarily loss of aldh1a1 in teleosts, revealing that Aldh1a1 is not a tetrapod innovation and that aldh1a3 was recently lost in medaka, making it the first known vertebrate with a single aldh1a gene. Interestingly, results revealed asymmetric distribution of surviving ohnologs between co-orthologous teleost chromosome segments, suggesting that local genome architecture can influence ohnolog survival. We propose a model that reconstructs the chromosomal history of the Aldh1a family in the ancestral vertebrate genome, coupled with the evolution of gene functions in surviving Aldh1a ohnologs after R1, R2, and R3 genome duplications. Results provide evidence for early subfunctionalization and late subfunction-partitioning and suggest a mechanistic model based on altered regulation leading to heterochronic gene expression to explain the acquisition or modification of subfunctions by surviving ohnologs that preserve unaltered ancestral developmental programs in the face of gene loss.
Gene duplication may facilitate the acquisition of genetic diversity. Little is known, however, about the impact of gene loss on the functions of surviving genes. When a gene is lost, can other closely related genes evolve to perform the functions of the lost gene? Answering this question can be difficult because the proof for gene loss is based on negative evidence and thus can easily pass unnoticed. Here, we illustrate how the comparison of genomic neighborhoods in different species can help reconstruct the chromosomal history of a gene family and provide robust evidence for gene loss, even without an appropriate early-diverging comparator group. Identifying gene loss is important because it helps distinguish between gene gain as a lineage-specific innovation and gene loss as a lineage-specific simplification. As a case study, we investigated the expression of the Aldh1a family, which is crucial for retinoic acid signaling in development of eyes, limbs, the brain, and in cancer. Results showed that gene loss is indeed associated with the evolution of functional change in surviving gene family members. Our results highlight the relevance of comparative genomics for identifying gene loss and improving the functional connectivity among human and model organism genomes.
Long-range chromosomal associations between genomic regions, and their repositioning in the 3D space of the nucleus, are now considered to be key contributors to the regulation of gene expression and important links have been highlighted with other genomic features involved in DNA rearrangements. Recent Chromosome Conformation Capture (3C) measurements performed with high throughput sequencing (Hi-C) and molecular dynamics studies show that there is a large correlation between colocalization and coregulation of genes, but these important researches are hampered by the lack of biologists-friendly analysis and visualisation software. Here, we describe NuChart, an R package that allows the user to annotate and statistically analyse a list of input genes with information relying on Hi-C data, integrating knowledge about genomic features that are involved in the chromosome spatial organization. NuChart works directly with sequenced reads to identify the related Hi-C fragments, with the aim of creating gene-centric neighbourhood graphs on which multi-omics features can be mapped. Predictions about CTCF binding sites, isochores and cryptic Recombination Signal Sequences are provided directly with the package for mapping, although other annotation data in bed format can be used (such as methylation profiles and histone patterns). Gene expression data can be automatically retrieved and processed from the Gene Expression Omnibus and ArrayExpress repositories to highlight the expression profile of genes in the identified neighbourhood. Moreover, statistical inferences about the graph structure and correlations between its topology and multi-omics features can be performed using Exponential-family Random Graph Models. The Hi-C fragment visualisation provided by NuChart allows the comparisons of cells in different conditions, thus providing the possibility of novel biomarkers identification. NuChart is compliant with the Bioconductor standard and it is freely available at ftp://fileserver.itb.cnr.it/nuchart.
Physical maps are important tools to uncover general chromosome structure as well as to compare different plant lineages and species, helping to elucidate genome structure, evolution and possibilities regarding synteny and colinearity. The increasing production of sequence data has opened an opportunity to link information from mapping studies to the underlying sequences. Genome browsers are invaluable platforms that provide access to these sequences, including tools for genome analysis, allowing the integration of multivariate information, and thus aiding to explain the emergence of complex genomes. The present work presents a tutorial regarding the use of genome browsers to develop targeted physical mapping, providing also a general overview and examples about the possibilities regarding the use of Fluorescent In Situ Hybridization (FISH) using bacterial artificial chromosomes (BAC), simple sequence repeats (SSR) and rDNA probes, highlighting the potential of such studies for map integration and comparative genetics. As a case study, the available genome of soybean was accessed to show how the physical and in silico distribution of such sequences may be compared at different levels. Such evaluations may also be complemented by the identification of sequences beyond the detection level of cytological methods, here using members of the aquaporin gene family as an example. The proposed approach highlights the complementation power of the combination of molecular cytogenetics and computational approaches for the anchoring of coding or repetitive sequences in plant genomes using available genome browsers, helping in the determination of sequence location, arrangement and number of repeats, and also filling gaps found in computational pseudochromosome assemblies.
gene families; FISH; BAC; SSR; aquaporin; bioinformatics
Summary: Track data hubs provide an efficient mechanism for visualizing remotely hosted Internet-accessible collections of genome annotations. Hub datasets can be organized, configured and fully integrated into the University of California Santa Cruz (UCSC) Genome Browser and accessed through the familiar browser interface. For the first time, individuals can use the complete browser feature set to view custom datasets without the overhead of setting up and maintaining a mirror.
Availability and implementation: Source code for the BigWig, BigBed and Genome Browser software is freely available for non-commercial use at http://hgdownload.cse.ucsc.edu/admin/jksrc.zip, implemented in C and supported on Linux. Binaries for the BigWig and BigBed creation and parsing utilities may be downloaded at http://hgdownload.cse.ucsc.edu/admin/exe/. Binary Alignment/Map (BAM) and Variant Call Format (VCF)/tabix utilities are available from http://samtools.sourceforge.net/ and http://vcftools.sourceforge.net/. The UCSC Genome Browser is publicly accessible at http://genome.ucsc.edu.
Motivation: KEGG PATHWAY is a service of Kyoto Encyclopedia of Genes and Genomes (KEGG), constructing manually curated pathway maps that represent current knowledge on biological networks in graph models. While valuable graph tools have been implemented in R/Bioconductor, to our knowledge there is currently no software package to parse and analyze KEGG pathways with graph theory.
Results: We introduce the software package KEGGgraph in R and Bioconductor, an interface between KEGG pathways and graph models as well as a collection of tools for these graphs. Superior to existing approaches, KEGGgraph captures the pathway topology and allows further analysis or dissection of pathway graphs. We demonstrate the use of the package by the case study of analyzing human pancreatic cancer pathway.
Availability:KEGGgraph is freely available at the Bioconductor web site (http://www.bioconductor.org). KGML files can be downloaded from KEGG FTP site (ftp://ftp.genome.jp/pub/kegg/xml).
Supplementary information: Supplementary data are available at Bioinformatics online.
Modern plant genomes are diploidized paleopolyploids. We revisited grass genome paleohistory in response to the diploidization process through a detailed investigation of the evolutionary fate of duplicated blocks. Ancestrally duplicated genes can be conserved, deleted, and shuffled, defining dominant (bias toward duplicate retention) and sensitive (bias toward duplicate erosion) chromosomal fragments. We propose a new grass genome paleohistory deriving from an ancestral karyotype structured in seven protochromosomes containing 16,464 protogenes and following evolutionary rules where 1) ancestral shared polyploidizations shaped conserved dominant (D) and sensitive (S) subgenomes, 2) subgenome dominance is revealed by both gene deletion and shuffling from the S blocks, 3) duplicate deletion/movement may have been mediated by single-/double-stranded illegitimate recombination mechanisms, 4) modern genomes arose through centromeric fusion of protochromosomes, leading to functional monocentric neochromosomes, 5) the fusion of two dominant blocks leads to supradominant neochromosomes (D + D = D) with higher ancestral gene retention compared with D + S = D (i.e., fusion of blocks with opposite sensitivity) or even S + S = S (i.e., fusion of two sensitive ancestral blocks). A new user-friendly online tool named “PlantSyntenyViewer,” available at http://urgi.versailles.inra.fr/synteny-cereal, presents the refined comparative genomics data.
synteny; evolution; genome; dominance; duplication; ancestor
The conservation of gene order among prokaryotic genomes can provide valuable insight into gene function, protein interactions, or events by which genomes have evolved. Although some tools are available for visualizing and comparing the order of genes between genomes of study, few support an efficient and organized analysis between large numbers of genomes. The Prokaryotic Sequence homology Analysis Tool (PSAT) is a web tool for comparing gene neighborhoods among multiple prokaryotic genomes.
PSAT utilizes a database that is preloaded with gene annotation, BLAST hit results, and gene-clustering scores designed to help identify regions of conserved gene order. Researchers use the PSAT web interface to find a gene of interest in a reference genome and efficiently retrieve the sequence homologs found in other bacterial genomes. The tool generates a graphic of the genomic neighborhood surrounding the selected gene and the corresponding regions for its homologs in each comparison genome. Homologs in each region are color coded to assist users with analyzing gene order among various genomes. In contrast to common comparative analysis methods that filter sequence homolog data based on alignment score cutoffs, PSAT leverages gene context information for homologs, including those with weak alignment scores, enabling a more sensitive analysis. Features for constraining or ordering results are designed to help researchers browse results from large numbers of comparison genomes in an organized manner. PSAT has been demonstrated to be useful for helping to identify gene orthologs and potential functional gene clusters, and detecting genome modifications that may result in loss of function.
PSAT allows researchers to investigate the order of genes within local genomic neighborhoods of multiple genomes. A PSAT web server for public use is available for performing analyses on a growing set of reference genomes through any web browser with no client side software setup or installation required. Source code is freely available to researchers interested in setting up a local version of PSAT for analysis of genomes not available through the public server. Access to the public web server and instructions for obtaining source code can be found at .