Some non-pathogenic trypanosomatids maintain a mutualistic relationship with a betaproteobacterium of the Alcaligenaceae family. Intensive nutritional exchanges have been reported between the two partners, indicating that these protozoa are excellent biological models to study metabolic co-evolution. We previously sequenced and herein investigate the entire genomes of five trypanosomatids which harbor a symbiotic bacterium (SHTs for Symbiont-Haboring Trypanosomatids) and the respective bacteria (TPEs for Trypanosomatid Proteobacterial Endosymbiont), as well as two trypanosomatids without symbionts (RTs
for Regular Trypanosomatids), for the presence of genes of the classical pathways for vitamin biosynthesis. Our data show that genes for the biosynthetic pathways of thiamine, biotin, and nicotinic acid are absent from all trypanosomatid genomes. This is in agreement with the absolute growth requirement for these vitamins in all protozoa of the family. Also absent from the genomes of RTs are the genes for the synthesis of pantothenic acid, folic acid, riboflavin, and vitamin B6. This is also in agreement with the available data showing that RTs are auxotrophic for these essential vitamins. On the other hand, SHTs are autotrophic for such vitamins. Indeed, all the genes of the corresponding biosynthetic pathways were identified, most of them in the symbiont genomes, while a few genes, mostly of eukaryotic origin, were found in the host genomes. The only exceptions to the latter are: the gene coding for the enzyme ketopantoate reductase (EC:126.96.36.199) which is related instead to the Firmicutes bacteria; and two other genes, one involved in the salvage pathway of pantothenic acid and the other in the synthesis of ubiquinone, that are related to Gammaproteobacteria. Their presence in trypanosomatids may result from lateral gene transfer. Taken together, our results reinforce the idea that the low nutritional requirement of SHTs is associated with the presence of the symbiotic bacterium, which contains most genes for vitamin production.
The type IV secretion system (T4SS) can be classified as a large family of macromolecule transporter systems, divided into three recognized sub-families, according to the well-known functions. The major sub-family is the conjugation system, which allows transfer of genetic material, such as a nucleoprotein, via cell contact among bacteria. Also, the conjugation system can transfer genetic material from bacteria to eukaryotic cells; such is the case with the T-DNA transfer of Agrobacterium tumefaciens to host plant cells. The system of effector protein transport constitutes the second sub-family, and the third one corresponds to the DNA uptake/release system. Genome analyses have revealed numerous T4SS in Bacteria and Archaea. The purpose of this work was to organize, classify, and integrate the T4SS data into a single database, called AtlasT4SS - the first public database devoted exclusively to this prokaryotic secretion system.
The AtlasT4SS is a manual curated database that describes a large number of proteins related to the type IV secretion system reported so far in Gram-negative and Gram-positive bacteria, as well as in Archaea. The database was created using the RDBMS MySQL and the Catalyst Framework based in the Perl programming language and using the Model-View-Controller (MVC) design pattern for Web. The current version holds a comprehensive collection of 1,617 T4SS proteins from 58 Bacteria (49 Gram-negative and 9 Gram-Positive), one Archaea and 11 plasmids. By applying the bi-directional best hit (BBH) relationship in pairwise genome comparison, it was possible to obtain a core set of 134 clusters of orthologous genes encoding T4SS proteins.
In our database we present one way of classifying orthologous groups of T4SSs in a hierarchical classification scheme with three levels. The first level comprises four classes that are based on the organization of genetic determinants, shared homologies, and evolutionary relationships: (i) F-T4SS, (ii) P-T4SS, (iii) I-T4SS, and (iv) GI-T4SS. The second level designates a specific well-known protein families otherwise an uncharacterized protein family. Finally, in the third level, each protein of an ortholog cluster is classified according to its involvement in a specific cellular process. AtlasT4SS database is open access and is available at http://www.t4ss.lncc.br.
The invasive land snail Achatina fulica is one of the most damaging agricultural pests worldwide representing a potentially serious threat to natural ecosystems and human health. This species is known to carry parasites and harbors a dense and metabolically active microbial community; however, little is known about its diversity and composition. Here, we assessed for the first time the complexity of bacterial communities occurring in the digestive tracts of field-collected snails (FC) by using culture-independent molecular analysis. Crop and intestinal bacteria in FC were then compared to those from groups of snails that were reared in the laboratory (RL) on a sugarcane-based diet. Most of the sequences recovered were novel and related to those reported for herbivorous gut. Changes in the relative abundance of Bacteroidetes and Firmicutes were observed when the snails were fed a high-sugar diet, suggesting that the snail gut microbiota can influence the energy balance equation. Furthermore, this study represents a first step in gaining a better understanding of land snail gut microbiota and shows that this is a complex holobiont system containing diverse, abundant and active microbial communities.
Species belonging to the Rhizobiales are intriguing and extensively researched for including both bacteria with the ability to fix nitrogen when in symbiosis with leguminous plants and pathogenic bacteria to animals and plants. Similarities between the strategies adopted by pathogenic and symbiotic Rhizobiales have been described, as well as high variability related to events of horizontal gene transfer. Although it is well known that chromosomal rearrangements, mutations and horizontal gene transfer influence the dynamics of bacterial genomes, in Rhizobiales, the scenario that determine pathogenic or symbiotic lifestyle are not clear and there are very few studies of comparative genomic between these classes of prokaryotic microorganisms trying to delineate the evolutionary characterization of symbiosis and pathogenesis.
Non-symbiotic nitrogen-fixing bacteria and bacteria involved in bioremediation closer to symbionts and pathogens in study may assist in the origin and ancestry genes and the gene flow occurring in Rhizobiales. The genomic comparisons of 19 species of Rhizobiales, including nitrogen-fixing, bioremediators and pathogens resulted in 33 common clusters to biological nitrogen fixation and pathogenesis, 15 clusters exclusive to all nitrogen-fixing bacteria and bacteria involved in bioremediation, 13 clusters found in only some nitrogen-fixing and bioremediation bacteria, 01 cluster exclusive to some symbionts, and 01 cluster found only in some pathogens analyzed. In BBH performed to all strains studied, 77 common genes were obtained, 17 of which were related to biological nitrogen fixation and pathogenesis. Phylogenetic reconstructions for Fix, Nif, Nod, Vir, and Trb showed possible horizontal gene transfer events, grouping species of different phenotypes.
The presence of symbiotic and virulence genes in both pathogens and symbionts does not seem to be the only determinant factor for lifestyle evolution in these microorganisms, although they may act in common stages of host infection. The phylogenetic analysis for many distinct operons involved in these processes emphasizes the relevance of horizontal gene transfer events in the symbiotic and pathogenic similarity.
DAS is a widely adopted protocol for providing syntactic interoperability among biological databases. The popularity of DAS is due to a simplified and elegant mechanism for data exchange that consists of sources exposing their RESTful interfaces for data access. As a growing number of DAS services are available for molecular biology resources, there is an incentive to explore this protocol in order to advance data discovery and integration among these resources.
We developed DASMiner, a Matlab toolkit for querying DAS data sources that enables creation of integrated biological models using the information available in DAS-compliant repositories. DASMiner is composed by a browser application and an API that work together to facilitate gathering of data from different DAS sources, which can be used for creating enriched datasets from multiple sources.
The browser is used to formulate queries and navigate data contained in DAS sources. Users can execute queries against these sources in an intuitive fashion, without the need of knowing the specific DAS syntax for the particular source. Using the source's metadata provided by the DAS Registry, the browser's layout adapts to expose only the set of commands and coordinate systems supported by the specific source. For this reason, the browser can interrogate any DAS source, independently of the type of data being served.
The API component of DASMiner may be used for programmatic access of DAS sources by programs in Matlab. Once the desired data is found during navigation, the query is exported in the format of an API call to be used within any Matlab application. We illustrate the use of DASMiner by creating integrative models of histone modification maps and protein-protein interaction networks. These enriched datasets were built by retrieving and integrating distributed genomic and proteomic DAS sources using the API.
The support of the DAS protocol allows that hundreds of molecular biology databases to be treated as a federated, online collection of resources. DASMiner enables full exploration of these resources, and can be used to deploy applications and create integrated views of biological systems using the information deposited in DAS repositories.
Vibrio taxonomy has been based on a polyphasic approach. In this study, we retrieve useful taxonomic information (i.e. data that can be used to distinguish different taxonomic levels, such as species and genera) from 32 genome sequences of different vibrio species. We use a variety of tools to explore the taxonomic relationship between the sequenced genomes, including Multilocus Sequence Analysis (MLSA), supertrees, Average Amino Acid Identity (AAI), genomic signatures, and Genome BLAST atlases. Our aim is to analyse the usefulness of these tools for species identification in vibrios.
We have generated four new genome sequences of three Vibrio species, i.e., V. alginolyticus 40B, V. harveyi-like 1DA3, and V. mimicus strains VM573 and VM603, and present a broad analyses of these genomes along with other sequenced Vibrio species. The genome atlas and pangenome plots provide a tantalizing image of the genomic differences that occur between closely related sister species, e.g. V. cholerae and V. mimicus. The vibrio pangenome contains around 26504 genes. The V. cholerae core genome and pangenome consist of 1520 and 6923 genes, respectively. Pangenomes might allow different strains of V. cholerae to occupy different niches. MLSA and supertree analyses resulted in a similar phylogenetic picture, with a clear distinction of four groups (Vibrio core group, V. cholerae-V. mimicus, Aliivibrio spp., and Photobacterium spp.). A Vibrio species is defined as a group of strains that share > 95% DNA identity in MLSA and supertree analysis, > 96% AAI, ≤ 10 genome signature dissimilarity, and > 61% proteome identity. Strains of the same species and species of the same genus will form monophyletic groups on the basis of MLSA and supertree.
The combination of different analytical and bioinformatics tools will enable the most accurate species identification through genomic computational analysis. This endeavour will culminate in the birth of the online genomic taxonomy whereby researchers and end-users of taxonomy will be able to identify their isolates through a web-based server. This novel approach to microbial systematics will result in a tremendous advance concerning biodiversity discovery, description, and understanding.
The Structural Descriptor Database (SDDB) is a web-based tool that predicts the function of proteins and functional site positions based on the structural properties of related protein families. Structural alignments and functional residues of a known protein set (defined as the training set) are used to build special Hidden Markov Models (HMM) called HMM descriptors. SDDB uses previously calculated and stored HMM descriptors for predicting active sites, binding residues, and protein function. The database integrates biologically relevant data filtered from several databases such as PDB, PDBSUM, CSA and SCOP. It accepts queries in fasta format and predicts functional residue positions, protein-ligand interactions, and protein function, based on the SCOP database.
To assess the SDDB performance, we used different data sets. The Trypsion-like Serine protease data set assessed how well SDDB predicts functional sites when curated data is available. The SCOP family data set was used to analyze SDDB performance by using training data extracted from PDBSUM (binding sites) and from CSA (active sites). The ATP-binding experiment was used to compare our approach with the most current method. For all evaluations, significant improvements were obtained with SDDB.
SDDB performed better when trusty training data was available. SDDB worked better in predicting active sites rather than binding sites because the former are more conserved than the latter. Nevertheless, by using our prediction method we obtained results with precision above 70%.
Little is known about bacterial transcriptional regulatory networks (TRNs). In Escherichia coli, which is the organism with the largest wet-lab validated TRN, its set of interactions involves only ~50% of the repertoire of transcription factors currently known, and ~25% of its genes. Of those, only a small proportion describes the regulation of processes that are clinically relevant, such as drug resistance mechanisms.
We designed feed-forward (FF) and bi-fan (BF) motif predictors for E. coli using multi-layer perceptron artificial neural networks (ANNs). The motif predictors were trained using a large dataset of gene expression data; the collection of motifs was extracted from the E. coli TRN. Each network motif was mapped to a vector of correlations which were computed using the gene expression profile of the elements in the motif. Thus, by combining network structural information with transcriptome data, FF and BF predictors were able to classify with a high precision of 83% and 96%, respectively, and with a high recall of 86% and 97%, respectively. These results were found when motifs were represented using different types of correlations together, i.e., Pearson, Spearman, Kendall, and partial correlation. We then applied the best predictors to hypothesize new regulations for 16 operons involved with multidrug resistance (MDR) efflux pumps, which are considered as a major bacterial mechanism to fight antimicrobial agents. As a result, the motif predictors assigned new transcription factors for these MDR proteins, turning them into high-quality candidates to be experimentally tested.
The motif predictors presented herein can be used to identify novel regulatory interactions by using microarray data. The presentation of an example motif to predictors will make them categorize whether or not the example motif is a BF, or whether or not it is an FF. This approach is useful to find new "pieces" of the TRN, when inspecting the regulation of a small set of operons. Furthermore, it shows that correlations of expression data can be used to discriminate between elements that are arranged in structural motifs and those in random sets of transcripts.
The inverse problem of identifying the topology of biological networks from their time series responses is a cornerstone challenge in systems biology. We tackle this challenge here through the parameterization of S-system models. It was previously shown that parameter identification can be performed as an optimization based on the decoupling of the differential S-system equations, which results in a set of algebraic equations.
A novel parameterization solution is proposed for the identification of S-system models from time series when no information about the network topology is known. The method is based on eigenvector optimization of a matrix formed from multiple regression equations of the linearized decoupled S-system. Furthermore, the algorithm is extended to the optimization of network topologies with constraints on metabolites and fluxes. These constraints rejoin the system in cases where it had been fragmented by decoupling. We demonstrate with synthetic time series why the algorithm can be expected to converge in most cases.
A procedure was developed that facilitates automated reverse engineering tasks for biological networks using S-systems. The proposed method of eigenvector optimization constitutes an advancement over S-system parameter identification from time series using a recent method called Alternating Regression. The proposed method overcomes convergence issues encountered in alternate regression by identifying nonlinear constraints that restrict the search space to computationally feasible solutions. Because the parameter identification is still performed for each metabolite separately, the modularity and linear time characteristics of the alternating regression method are preserved. Simulation studies illustrate how the proposed algorithm identifies the correct network topology out of a collection of models which all fit the dynamical time series essentially equally well.
In the past years, several studies begun to unravel the structure, dynamical properties, and evolution of transcriptional regulatory networks. However, even those comparative studies that focus on a group of closely related organisms are limited by the rather scarce knowledge on regulatory interactions outside a few model organisms, such as E. coli among the prokaryotes.
In this paper we used the information annotated in Tractor_DB (a database of regulatory networks in gamma-proteobacteria) to calculate a normalized Site Orthology Score (SOS) that quantifies the conservation of a regulatory link across thirty genomes of this subclass. Then we used this SOS to assess how regulatory connections have evolved in this group, and how the variation of basic regulatory connection is reflected on the structure of the chromosome. We found that individual regulatory interactions shift between different organisms, a process that may be described as rewiring the network. At this evolutionary scale (the gamma-proteobacteria subclass) this rewiring process may be an important source of variation of regulatory incoming interactions for individual networks. We also noticed that the regulatory links that form feed forward motifs are conserved in a better correlated manner than triads of random regulatory interactions or pairs of co-regulated genes. Furthermore, the rewiring process that takes place at the most basic level of the regulatory network may be linked to rearrangements of genetic material within bacterial chromosomes, which change the structure of Transcription Units and therefore the regulatory connections between Transcription Factors and structural genes.
The rearrangements that occur in bacterial chromosomes-mostly inversion or horizontal gene transfer events – are important sources of variation of gene regulation at this evolutionary scale.
Structure identification of dynamic models for complex biological systems is the cornerstone of their reverse engineering. Biochemical Systems Theory (BST) offers a particularly convenient solution because its parameters are kinetic-order coefficients which directly identify the topology of the underlying network of processes. We have previously proposed a numerical decoupling procedure that allows the identification of multivariate dynamic models of complex biological processes. While described here within the context of BST, this procedure has a general applicability to signal extraction. Our original implementation relied on artificial neural networks (ANN), which caused slight, undesirable bias during the smoothing of the time courses. As an alternative, we propose here an adaptation of the Whittaker's smoother and demonstrate its role within a robust, fully automated structure identification procedure.
In this report we propose a robust, fully automated solution for signal extraction from time series, which is the prerequisite for the efficient reverse engineering of biological systems models. The Whittaker's smoother is reformulated within the context of information theory and extended by the development of adaptive signal segmentation to account for heterogeneous noise structures. The resulting procedure can be used on arbitrary time series with a nonstationary noise process; it is illustrated here with metabolic profiles obtained from in-vivo NMR experiments. The smoothed solution that is free of parametric bias permits differentiation, which is crucial for the numerical decoupling of systems of differential equations.
The method is applicable in signal extraction from time series with nonstationary noise structure and can be applied in the numerical decoupling of system of differential equations into algebraic equations, and thus constitutes a rather general tool for the reverse engineering of mechanistic model descriptions from multivariate experimental time series.
The version 2.0 of Tractor_DB is now accessible at its three international mirrors: , and . This database contains a collection of computationally predicted Transcription Factors' binding sites in gamma-proteobacterial genomes. These data should aid researchers in the design of microarray experiments and the interpretation of their results. They should also facilitate studies of Comparative Genomics of the regulatory networks of this group of organisms. In this paper we describe the main improvements incorporated to the database in the past year and a half which include incorporating information on the regulatory networks of 13—increasing to 30—new gamma-proteobacteria and developing a new computational strategy to complement the putative sites identified by the original weight matrix-based approach. We have also added dynamically generated navigation tabs to the navigation interfaces. Moreover, we developed a new interface that allows users to directly retrieve information on the conservation of regulatory interactions in the 30 genomes included in the database by navigating a map that represents a core of the known Escherichia coli regulatory network.
This work reports the results of analyses of three complete mycoplasma genomes, a pathogenic (7448) and a nonpathogenic (J) strain of the swine pathogen Mycoplasma hyopneumoniae and a strain of the avian pathogen Mycoplasma synoviae; the genome sizes of the three strains were 920,079 bp, 897,405 bp, and 799,476 bp, respectively. These genomes were compared with other sequenced mycoplasma genomes reported in the literature to examine several aspects of mycoplasma evolution. Strain-specific regions, including integrative and conjugal elements, and genome rearrangements and alterations in adhesin sequences were observed in the M. hyopneumoniae strains, and all of these were potentially related to pathogenicity. Genomic comparisons revealed that reduction in genome size implied loss of redundant metabolic pathways, with maintenance of alternative routes in different species. Horizontal gene transfer was consistently observed between M. synoviae and Mycoplasma gallisepticum. Our analyses indicated a likely transfer event of hemagglutinin-coding DNA sequences from M. gallisepticum to M. synoviae.
Trypanosomatids of the genera Angomonas and Strigomonas live in a mutualistic association characterized by extensive metabolic cooperation with obligate endosymbiotic Betaproteobacteria. However, the role played by the symbiont has been more guessed by indirect means than evidenced. Symbiont-harboring trypanosomatids, in contrast to their counterparts lacking symbionts, exhibit lower nutritional requirements and are autotrophic for essential amino acids. To evidence the symbiont’s contributions to this autotrophy, entire genomes of symbionts and trypanosomatids with and without symbionts were sequenced here.
Analyses of the essential amino acid pathways revealed that most biosynthetic routes are in the symbiont genome. By contrast, the host trypanosomatid genome contains fewer genes, about half of which originated from different bacterial groups, perhaps only one of which (ornithine cyclodeaminase, EC:188.8.131.52) derived from the symbiont. Nutritional, enzymatic, and genomic data were jointly analyzed to construct an integrated view of essential amino acid metabolism in symbiont-harboring trypanosomatids. This comprehensive analysis showed perfect concordance among all these data, and revealed that the symbiont contains genes for enzymes that complete essential biosynthetic routes for the host amino acid production, thus explaining the low requirement for these elements in symbiont-harboring trypanosomatids. Phylogenetic analyses show that the cooperation between symbionts and their hosts is complemented by multiple horizontal gene transfers, from bacterial lineages to trypanosomatids, that occurred several times in the course of their evolution. Transfers occur preferentially in parts of the pathways that are missing from other eukaryotes.
We have herein uncovered the genetic and evolutionary bases of essential amino acid biosynthesis in several trypanosomatids with and without endosymbionts, explaining and complementing decades of experimental results. We uncovered the remarkable plasticity in essential amino acid biosynthesis pathway evolution in these protozoans, demonstrating heavy influence of horizontal gene transfer events, from Bacteria to trypanosomatid nuclei, in the evolution of these pathways.
Endosymbiosis; Trypanosomatids; Amino acid biosynthesis; Phylogeny; Genomic analyses; Metabolic pathway evolution; Proteobacteria
The shortage of petroleum reserves and the increase in CO2 emissions have raised global concerns and highlighted the importance of adopting sustainable energy sources. Second-generation ethanol made from lignocellulosic materials is considered to be one of the most promising fuels for vehicles. The giant snail Achatina fulica is an agricultural pest whose biotechnological potential has been largely untested. Here, the composition of the microbial population within the crop of this invasive land snail, as well as key genes involved in various biochemical pathways, have been explored for the first time. In a high-throughput approach, 318 Mbp of 454-Titanium shotgun metagenomic sequencing data were obtained. The predominant bacterial phylum found was Proteobacteria, followed by Bacteroidetes and Firmicutes. Viruses, Fungi, and Archaea were present to lesser extents. The functional analysis reveals a variety of microbial genes that could assist the host in the degradation of recalcitrant lignocellulose, detoxification of xenobiotics, and synthesis of essential amino acids and vitamins, contributing to the adaptability and wide-ranging diet of this snail. More than 2,700 genes encoding glycoside hydrolase (GH) domains and carbohydrate-binding modules were detected. When we compared GH profiles, we found an abundance of sequences coding for oligosaccharide-degrading enzymes (36%), very similar to those from wallabies and giant pandas, as well as many novel cellulase and hemicellulase coding sequences, which points to this model as a remarkable potential source of enzymes for the biofuel industry. Furthermore, this work is a major step toward the understanding of the unique genetic profile of the land snail holobiont.
An important virulence factor of Klebsiella pneumoniae is the production of capsular polysaccharide (CPS), a thick mucus layer that allows for evasion of the host's defense and creates a barrier against antibacterial peptides. CPS production is driven mostly by the expression of genes located in a locus called cps, and the resulting structure is used to distinguish between different serotypes (K types). In this study, we report the unique genetic organization of the cps cluster from K. pneumoniae Kp13, a clinical isolate recovered during a large outbreak of nosocomial infections that occurred in a Brazilian teaching hospital.
A pyrosequencing-based approach showed that the cps region of Kp13 (cpsKp13) is 26.4 kbp in length and contains genes common, although not universal, to other strains, such as the rmlBADC operon that codes for L-rhamnose synthesis. cpsKp13 also presents some unique features, like the inversion of the wzy gene and a unique repertoire of glycosyltransferases. In silico comparison of cpsKp13 RFLP pattern with 102 previously published cps PCR-RFLP patterns showed that cpsKp13 is distinct from the C patterns of all other K serotypes. Furthermore, in vitro serotyping showed only a weak reaction with capsular types K9 and K34. We confirm that K9 cps shares common genes with cpsKp13 such as the rmlBADC operon, but lacks features like uge and Kp13-specific glycosyltransferases, while K34 capsules contain three of the five sugars that potentially form the Kp13 CPS.
We report the first description of a cps cluster from a Brazilian clinical isolate of a KPC-producing K. pneumoniae. The gathered data including K-serotyping support that Kp13’s K-antigen belongs to a novel capsular serotype. The CPS of Kp13 probably includes L-rhamnose and D-galacturonate in its structure, among other residues. Because genes involved in L-rhamnose biosynthesis are absent in humans, this pathway may represent potential targets for the development of antimicrobial agents. Studying the capsular serotypes of clinical isolates is of great importance for further development of vaccines and/or novel therapeutic agents. The distribution of K-types among multidrug-resistant isolates is unknown, but our findings may encourage scientists to perform K-antigen typing of KPC-producing strains worldwide.
Capsular gene cluster; Capsular polysaccharide; K-antigen; KPC-producing K. pneumoniae; Molecular serotyping; Monosaccharide biosynthesis pathways
Although patterns of somatic alterations have been reported for tumor genomes, little is known on how they compare with alterations present in non-tumor genomes. A comparison of the two would be crucial to better characterize the genetic alterations driving tumorigenesis. We sequenced the genomes of a lymphoblastoid (HCC1954BL) and a breast tumor (HCC1954) cell line derived from the same patient and compared the somatic alterations present in both. The lymphoblastoid genome presents a comparable number and similar spectrum of nucleotide substitutions to that found in the tumor genome. However, a significant difference in the ratio of non-synonymous to synonymous substitutions was observed between both genomes (P = 0.031). Protein–protein interaction analysis revealed that mutations in the tumor genome preferentially affect hub-genes (P = 0.0017) and are co-selected to present synergistic functions (P < 0.0001). KEGG analysis showed that in the tumor genome most mutated genes were organized into signaling pathways related to tumorigenesis. No such organization or synergy was observed in the lymphoblastoid genome. Our results indicate that endogenous mutagens and replication errors can generate the overall number of mutations required to drive tumorigenesis and that it is the combination rather than the frequency of mutations that is crucial to complete tumorigenic transformation.
To identify potential tumor suppressor genes, genome-wide data from exome and transcriptome sequencing were combined to search for genes with loss of heterozygosity and allele-specific expression. The analysis was conducted on the breast cancer cell line HCC1954, and a lymphoblast cell line from the same individual, HCC1954BL.
By comparing exome sequences from the two cell lines, we identified loss of heterozygosity events at 403 genes in HCC1954 and at one gene in HCC1954BL. The combination of exome and transcriptome sequence data also revealed 86 and 50 genes with allele specific expression events in HCC1954 and HCC1954BL, which comprise 5.4% and 2.6% of genes surveyed, respectively. Many of these genes identified by loss of heterozygosity and allele-specific expression are known or putative tumor suppressor genes, such as BRCA1, MSH3 and SETX, which participate in DNA repair pathways.
Our results demonstrate that the combined application of high throughput sequencing to exome and allele-specific transcriptome analysis can reveal genes with known tumor suppressor characteristics, and a shortlist of novel candidates for the study of tumor suppressor activities.
The potency of the immune response has still to be harnessed effectively to combat human cancers. However, the discovery of T-cell targets in melanomas and other tumors has raised the possibility that cancer vaccines can be used to induce a therapeutically effective immune response against cancer. The targets, cancer-testis (CT) antigens, are immunogenic proteins preferentially expressed in normal gametogenic tissues and different histological types of tumors. Therapeutic cancer vaccines directed against CT antigens are currently in late-stage clinical trials testing whether they can delay or prevent recurrence of lung cancer and melanoma following surgical removal of primary tumors. CT antigens constitute a large, but ill-defined, family of proteins that exhibit a remarkably restricted expression. Currently, there is a considerable amount of information about these proteins, but the data are scattered through the literature and in several bioinformatic databases. The database presented here, CTdatabase (http://www.cta.lncc.br), unifies this knowledge to facilitate both the mining of the existing deluge of data, and the identification of proteins alleged to be CT antigens, but that do not have their characteristic restricted expression pattern. CTdatabase is more than a repository of CT antigen data, since all the available information was carefully curated and annotated with most data being specifically processed for CT antigens and stored locally. Starting from a compilation of known CT antigens, CTdatabase provides basic information including gene names and aliases, RefSeq accession numbers, genomic location, known splicing variants, gene duplications and additional family members. Gene expression at the mRNA level in normal and tumor tissues has been collated from publicly available data obtained by several different technologies. Manually curated data related to mRNA and protein expression, and antigen-specific immune responses in cancer patients are also available, together with links to PubMed for relevant CT antigen articles.