Despite recent progress in the identification of genetic and molecular alterations in prostate cancer, markers associated with tumor progression are scarce. Therefore precise diagnosis of patients and prognosis of the disease remain difficult. This study investigated novel molecular markers discriminating between low and highly aggressive types of prostate cancer.
Using 52 microdissected cell populations of low- and high-risk prostate tumors, we identified via global cDNA microarrays analysis almost 1200 genes being differentially expressed among these groups. These genes were analyzed by statistical, pathway and gene enrichment methods. Twenty selected candidate genes were verified by quantitative real time PCR and immunohistochemistry. In concordance with the mRNA levels, two genes MAP3K5 and PDIA3 exposed differential protein expression. Functional characterization of PDIA3 revealed a pro-apoptotic role of this gene in PC3 prostate cancer cells.
Our analyses provide deeper insights into the molecular changes occurring during prostate cancer progression. The genes MAP3K5 and PDIA3 are associated with malignant stages of prostate cancer and therefore provide novel potential biomarkers.
Autism and mental retardation (MR) show high rates of comorbidity and potentially share genetic risk factors. In this study, a rare ∼2 Mb microdeletion involving chromosome band 15q13.3 was detected in a multiplex autism family. This genomic loss lies between distal break points of the Prader–Willi/Angelman syndrome locus and was first described in association with MR and epilepsy. Together with recent studies that have also implicated this genomic imbalance in schizophrenia, our data indicate that this CNV shows considerable phenotypic variability. Further studies should aim to characterise the precise phenotypic range of this CNV and may lead to the discovery of genetic or environmental modifiers.
autism; CNV; genetic modifier; learning disability; schizophrenia; phenotypic variability
In breast cancer, overexpression of the transmembrane tyrosine kinase ERBB2 is an adverse prognostic marker, and occurs in almost 30% of the patients. For therapeutic intervention, ERBB2 is targeted by monoclonal antibody trastuzumab in adjuvant settings; however, de novo resistance to this antibody is still a serious issue, requiring the identification of additional targets to overcome resistance. In this study, we have combined computational simulations, experimental testing of simulation results, and finally reverse engineering of a protein interaction network to define potential therapeutic strategies for de novo trastuzumab resistant breast cancer.
First, we employed Boolean logic to model regulatory interactions and simulated single and multiple protein loss-of-functions. Then, our simulation results were tested experimentally by producing single and double knockdowns of the network components and measuring their effects on G1/S transition during cell cycle progression. Combinatorial targeting of ERBB2 and EGFR did not affect the response to trastuzumab in de novo resistant cells, which might be due to decoupling of receptor activation and cell cycle progression. Furthermore, examination of c-MYC in resistant as well as in sensitive cell lines, using a specific chemical inhibitor of c-MYC (alone or in combination with trastuzumab), demonstrated that both trastuzumab sensitive and resistant cells responded to c-MYC perturbation.
In this study, we connected ERBB signaling with G1/S transition of the cell cycle via two major cell signaling pathways and two key transcription factors, to model an interaction network that allows for the identification of novel targets in the treatment of trastuzumab resistant breast cancer. Applying this new strategy, we found that, in contrast to trastuzumab sensitive breast cancer cells, combinatorial targeting of ERBB receptors or of key signaling intermediates does not have potential for treatment of de novo trastuzumab resistant cells. Instead, c-MYC was identified as a novel potential target protein in breast cancer cells.
Motivation: Functional characterization of genes is of great importance for the understanding of complex cellular processes. Valuable information for this purpose can be obtained from pathway databases, like KEGG. However, only a small fraction of genes is annotated with pathway information up to now. In contrast, information on contained protein domains can be obtained for a significantly higher number of genes, e.g. from the InterPro database.
Results: We present a classification model, which for a specific gene of interest can predict the mapping to a KEGG pathway, based on its domain signature. The classifier makes explicit use of the hierarchical organization of pathways in the KEGG database. Furthermore, we take into account that a specific gene can be mapped to different pathways at the same time. The classification method produces a scoring of all possible mapping positions of the gene in the KEGG hierarchy. Evaluations of our model, which is a combination of a SVM and ranking perceptron approach, show a high prediction performance. Moreover, for signaling pathways we reveal that it is even possible to forecast accurately the membership to individual pathway components.
Availability: The R package gene2pathway is a supplement to this article.
Supplementary Information: Supplementary data are available at Bioinformatics online.
Motivation: Targeted interventions using RNA interference in combination with the measurement of secondary effects with DNA microarrays can be used to computationally reverse engineer features of upstream non-transcriptional signaling cascades based on the nested structure of effects.
Results: We extend previous work by Markowetz et al., who proposed a statistical framework to score different network hypotheses. Our extensions go in several directions: we show how prior assumptions on the network structure can be incorporated into the scoring scheme by defining appropriate prior distributions on the network structure as well as on hyperparameters. An approach called module networks is introduced to scale up the original approach, which is limited to around 5 genes, to infer large-scale networks of more than 30 genes. Instead of the data discretization step needed in the original framework, we propose the usage of a beta-uniform mixture distribution on the P-value profile, resulting from differential gene expression calculation, to quantify effects. Extensive simulations on artificial data and application of our module network approach to infer the signaling network between 13 genes in the ER-α pathway in human MCF-7 breast cancer cells show that our approach gives sensible results. Using a bootstrapping and a jackknife approach, this reconstruction is found to be statistically stable.
Availability: The proposed method is available within the Bioconductor R-package nem.
An arbitrary set of 96 human proteins was selected and tested to set-up a fully automated protein production strategy, covering all steps from DNA preparation to protein purification and analysis. The target proteins are encoded by functionally uncharacterized open reading frames (ORF) identified by the German cDNA consortium. Fusion proteins were produced in E. coli with four different fusion tags and tested in five different purification strategies depending on the respective fusion tag. The automated strategy relies on standard liquid handling and clone picking equipment.
A robust automated strategy for the production of recombinant human proteins in E. coli was established based on a set of four different protein expression vectors resulting in NusA/His, MBP/His, GST and His-tagged proteins. The yield of soluble fusion protein was correlated with the induction temperature and the respective fusion tag. NusA/His and MBP/His fusion proteins are best expressed at low temperature (25°C), whereas the yield of soluble GST fusion proteins was higher when protein expression was induced at elevated temperature. In contrast, the induction of soluble His-tagged fusion proteins was independent of the temperature. Amylose was not found useful for affinity-purification of MBP/His fusion proteins in a high-throughput setting, and metal chelating chromatography is recommended instead.
Soluble fusion proteins can be produced in E. coli in sufficient qualities and μg/ml culture quantities for downstream applications like microarray-based assays, and studies on protein-protein interactions employing a fully automated protein expression and purification strategy. Future applications might include the optimization of experimental conditions for the large-scale production of soluble recombinant proteins from libraries of open reading frames.
High-throughput technologies like functional screens and gene expression analysis produce extended lists of candidate genes. Gene-Set Enrichment Analysis is a commonly used and well established technique to test for the statistically significant over-representation of particular pathways. A shortcoming of this method is however, that most genes that are investigated in the experiments have very sparse functional or pathway annotation and therefore cannot be the target of such an analysis. The approach presented here aims to assign lists of genes with limited annotation to previously described functional gene collections or pathways. This works by comparing InterPro domain signatures of the candidate gene lists with domain signatures of gene sets derived from known classifications, e.g. KEGG pathways.
In order to validate our approach, we designed a simulation study. Based on all pathways available in the KEGG database, we create test gene lists by randomly selecting pathway genes, removing these genes from the known pathways and adding variable amounts of noise in the form of genes not annotated to the pathway. We show that we can recover pathway memberships based on the simulated gene lists with high accuracy. We further demonstrate the applicability of our approach on a biological example.
Results based on simulation and data analysis show that domain based pathway enrichment analysis is a very sensitive method to test for enrichment of pathways in sparsely annotated lists of genes. An R based software package domainsignatures, to routinely perform this analysis on the results of high-throughput screening, is available via Bioconductor.
With the completion of the human genome sequence the functional analysis and characterization of the encoded proteins has become the next urging challenge in the post-genome era. The lack of comprehensive ORFeome resources has thus far hampered systematic applications by protein gain-of-function analysis. Gene and ORF coverage with full-length ORF clones thus needs to be extended. In combination with a unique and versatile cloning system, these will provide the tools for genome-wide systematic functional analyses, to achieve a deeper insight into complex biological processes.
Here we describe the generation of a full-ORF clone resource of human genes applying the Gateway cloning technology (Invitrogen). A pipeline for efficient cloning and sequencing was developed and a sample tracking database was implemented to streamline the clone production process targeting more than 2,200 different ORFs. In addition, a robust cloning strategy was established, permitting the simultaneous generation of two clone variants that contain a particular ORF with as well as without a stop codon by the implementation of only one additional working step into the cloning procedure. Up to 92 % of the targeted ORFs were successfully amplified by PCR and more than 93 % of the amplicons successfully cloned.
The German cDNA Consortium ORFeome resource currently consists of more than 3,800 sequence-verified entry clones representing ORFs, cloned with and without stop codon, for about 1,700 different gene loci. 177 splice variants were cloned representing 121 of these genes. The entry clones have been used to generate over 5,000 different expression constructs, providing the basis for functional profiling applications. As a member of the recently formed international ORFeome collaboration we substantially contribute to generating and providing a whole genome human ORFeome collection in a unique cloning system that is made freely available in the community.
The advent of RNA interference techniques enables the selective silencing of biologically interesting genes in an efficient way. In combination with DNA microarray technology this enables researchers to gain insights into signaling pathways by observing downstream effects of individual knock-downs on gene expression. These secondary effects can be used to computationally reverse engineer features of the upstream signaling pathway.
In this paper we address this challenging problem by extending previous work by Markowetz et al., who proposed a statistical framework to score networks hypotheses in a Bayesian manner. Our extensions go in three directions: First, we introduce a way to omit the data discretization step needed in the original framework via a calculation based on p-values instead. Second, we show how prior assumptions on the network structure can be incorporated into the scoring scheme using regularization techniques. Third and most important, we propose methods to scale up the original approach, which is limited to around 5 genes, to large scale networks.
Comparisons of these methods on artificial data are conducted. Our proposed module network is employed to infer the signaling network between 13 genes in the ER-α pathway in human MCF-7 breast cancer cells. Using a bootstrapping approach this reconstruction can be found with good statistical stability.
The code for the module network inference method is available in the latest version of the R-package nem, which can be obtained from the Bioconductor homepage.
Deleted in Malignant Brain Tumors 1 (DMBT1) is a secreted scavenger receptor cysteine-rich protein that binds various bacteria and is thought to participate in innate pulmonary host defense. We hypothesized that pulmonary DMBT1 could contribute to respiratory distress syndrome in neonates by modulating surfactant function.
DMBT1 expression was studied by immunohistochemistry and mRNA in situ hybridization in post-mortem lungs of preterm and full-term neonates with pulmonary hyaline membranes. The effect of human recombinant DMBT1 on the function of bovine and porcine surfactant was measured by a capillary surfactometer. DMBT1-levels in tracheal aspirates of ventilated preterm and term infants were determined by ELISA.
Pulmonary DMBT1 was localized in hyaline membranes during respiratory distress syndrome. In vitro addition of human recombinant DMBT1 to the surfactants increased surface tension in a dose-dependent manner. The DMBT1-mediated effect was reverted by the addition of calcium depending on the surfactant preparation.
Our data showed pulmonary DMBT1 expression in hyaline membranes during respiratory distress syndrome and demonstrated that DMBT1 increases lung surface tension in vitro. This raises the possibility that DMBT1 could antagonize surfactant supplementation in respiratory distress syndrome and could represent a candidate target molecule for therapeutic intervention in neonatal lung disease.
With the increased availability of high throughput data, such as DNA microarray data, researchers are capable of producing large amounts of biological data. During the analysis of such data often there is the need to further explore the similarity of genes not only with respect to their expression, but also with respect to their functional annotation which can be obtained from Gene Ontology (GO).
We present the freely available software package GOSim, which allows to calculate the functional similarity of genes based on various information theoretic similarity concepts for GO terms. GOSim extends existing tools by providing additional lately developed functional similarity measures for genes. These can e.g. be used to cluster genes according to their biological function. Vice versa, they can also be used to evaluate the homogeneity of a given grouping of genes with respect to their GO annotation. GOSim hence provides the researcher with a flexible and powerful tool to combine knowledge stored in GO with experimental data. It can be seen as complementary to other tools that, for instance, search for significantly overrepresented GO terms within a given group of genes.
GOSim is implemented as a package for the statistical computing environment R and is distributed under GPL within the CRAN project.
The German cDNA Consortium has been cloning full length cDNAs and continued with their exploitation in protein localization experiments and cellular assays. However, the efficient use of large cDNA resources requires the development of strategies that are capable of a speedy selection of truly useful cDNAs from biological and experimental noise. To this end we have developed a new high-throughput analysis tool, CAFTAN, which simplifies these efforts and thus fills the gap between large-scale cDNA collections and their systematic annotation and application in functional genomics.
CAFTAN is built around the mapping of cDNAs to the genome assembly, and the subsequent analysis of their genomic context. It uses sequence features like the presence and type of PolyA signals, inner and flanking repeats, the GC-content, splice site types, etc. All these features are evaluated in individual tests and classify cDNAs according to their sequence quality and likelihood to have been generated from fully processed mRNAs. Additionally, CAFTAN compares the coordinates of mapped cDNAs with the genomic coordinates of reference sets from public available resources (e.g., VEGA, ENSEMBL). This provides detailed information about overlapping exons and the structural classification of cDNAs with respect to the reference set of splice variants.
The evaluation of CAFTAN showed that is able to correctly classify more than 85% of 5950 selected "known protein-coding" VEGA cDNAs as high quality multi- or single-exon. It identified as good 80.6 % of the single exon cDNAs and 85 % of the multiple exon cDNAs.
The program is written in Perl and in a modular way, allowing the adoption of this strategy to other tasks like EST-annotation, or to extend it by adding new classification rules and new organism databases as they become available. We think that it is a very useful program for the annotation and research of unfinished genomes.
CAFTAN is a high-throughput sequence analysis tool, which performs a fast and reliable quality prediction of cDNAs. Several thousands of cDNAs can be analyzed in a short time, giving the curator/scientist a first quick overview about the quality and the already existing annotation of a set of cDNAs. It supports the rejection of low quality cDNAs and helps in the selection of likely novel splice variants, and/or completely novel transcripts for new experiments.
A software tool for the analysis of high-throughput cell-based assays is presented.
Highthroughput cell-based assays with flow cytometric readout provide a powerful technique for identifying components of biologic pathways and their interactors. Interpretation of these large datasets requires effective computational methods. We present a new approach that includes data pre-processing, visualization, quality assessment, and statistical inference. The software is freely available in the Bioconductor package prada. The method permits analysis of large screens to detect the effects of molecular interventions in cellular systems.
The identification of patterns in biological sequences is a key challenge in genome analysis and in proteomics. Frequently such patterns are complex and highly variable, especially in protein sequences. They are frequently described using terms of regular expressions (RegEx) because of the user-friendly terminology. Limitations arise for queries with the increasing complexity of patterns and are accompanied by requirements for enhanced capabilities. This is especially true for patterns containing ambiguous characters and positions and/or length ambiguities.
We have implemented the 3of5 web application in order to enable complex pattern matching in protein sequences. 3of5 is named after a special use of its main feature, the novel n-of-m pattern type. This feature allows for an extensive specification of variable patterns where the individual elements may vary in their position, order, and content within a defined stretch of sequence. The number of distinct elements can be constrained by operators, and individual characters may be excluded. The n-of-m pattern type can be combined with common regular expression terms and thus also allows for a comprehensive description of complex patterns. 3of5 increases the fidelity of pattern matching and finds ALL possible solutions in protein sequences in cases of length-ambiguous patterns instead of simply reporting the longest or shortest hits. Grouping and combined search for patterns provides a hierarchical arrangement of larger patterns sets. The algorithm is implemented as internet application and freely accessible. The application is available at .
The 3of5 application offers an extended vocabulary for the definition of search patterns and thus allows the user to comprehensively specify and identify peptide patterns with variable elements. The n-of-m pattern type offers an improved accuracy for pattern matching in combination with the ability to find all solutions, without compromising the user friendliness of regular expression terms.
Well known for its gene density and the large number of mapped diseases, the human sub-chromosomal region Xq28 has long been a focus of genome research. Over 40 of approximately 300 X-linked diseases map to this region, and systematic mapping, transcript identification, and mutation analysis has led to the identification of causative genes for 26 of these diseases, leaving another 17 diseases mapped to Xq28, where the causative gene is still unknown. To expedite disease gene identification, we have initiated the functional characterisation of all known Xq28 genes.
By using a systematic approach, we describe the Xq28 genes by RNA in situ hybridisation and Northern blotting of the mouse orthologs, as well as subcellular localisation and data mining of the human genes. We have developed a relational web-accessible database with comprehensive query options integrating all experimental data. Using this database, we matched gene expression patterns with affected tissues for 16 of the 17 remaining Xq28 linked diseases, where the causative gene is unknown.
By using this systematic approach, we have prioritised genes in linkage regions of Xq28-mapped diseases to an amenable number for mutational screens. Our database can be queried by any researcher performing highly specified searches including diseases not listed in OMIM or diseases that might be linked to Xq28 in the future.
LIFEdb () integrates data from large-scale functional genomics assays and manual cDNA annotation with bioinformatics gene expression and protein analysis. New features of LIFEdb include (i) an updated user interface with enhanced query capabilities, (ii) a configurable output table and the option to download search results in XML, (iii) the integration of data from cell-based screening assays addressing the influence of protein-overexpression on cell proliferation and (iv) the display of the relative expression (‘Electronic Northern’) of the genes under investigation using curated gene expression ontology information. LIFEdb enables researchers to systematically select and characterize genes and proteins of interest, and presents data and information via its user-friendly web-based interface.
Given the complexity of higher organisms, the number of genes encoded by their genomes is surprisingly small. Tissue specific regulation of expression and splicing are major factors enhancing the number of the encoded products. Commonly these mechanisms are intragenic and affect only one gene.
Here we provide evidence that the IL4I1 gene is specifically transcribed from the apparent promoter of the upstream NUP62 gene, and that the first two exons of NUP62 are also contained in the novel IL4I1_2 variant. While expression of IL4I1 driven from its previously described promoter is found mostly in B cells, the expression driven by the NUP62 promoter is restricted to cells in testis (Sertoli cells) and in the brain (e.g., Purkinje cells). Since NUP62 is itself ubiquitously expressed, the IL4I1_2 variant likely derives from cell type specific alternative pre-mRNA processing.
Comparative genomics suggest that the promoter upstream of the NUP62 gene originally belonged to the IL4I1 gene and was later acquired by NUP62 via insertion of a retroposon. Since both genes are apparently essential, the promoter had to serve two genes afterwards. Expression of the IL4I1 gene from the "NUP62" promoter and the tissue specific involvement of the pre-mRNA processing machinery to regulate expression of two unrelated proteins indicate a novel mechanism of gene regulation.
cDNA libraries are widely used to identify genes and splice variants, and as a physical resource for full-length clones. Conventionally-generated cDNA libraries contain a high percentage of 5'-truncated clones. Current library construction methods that enrich for full-length mRNA are laborious, and involve several enzymatic steps performed on mRNA, which renders them sensitive to RNA degradation. The SMART technique for full-length enrichment is robust but results in limited cDNA insert size of the library.
We describe a method to construct SMART full-length enriched cDNA libraries with large insert sizes. Sub-libraries were generated from size-fractionated cDNA with an average insert size of up to seven kb. The percentage of full-length clones was calculated for different size ranges from BLAST results of over 12,000 5'ESTs.
The presented technique is suitable to generate full-length enriched cDNA libraries with large average insert sizes in a straightforward and robust way. The representation of full-coding clones is high also for large cDNAs (70%, 4–10 kb), when high-quality starting mRNA is used.
The requirement of a large amount of high-quality RNA is a major limiting factor for microarray experiments using biopsies. An average microarray experiment requires 10–100 μg of RNA. However, due to their small size, most biopsies do not yield this amount. Several different approaches for RNA amplification in vitro have been described and applied for microarray studies. In most of these, systematic analyses of the potential bias introduced by the enzymatic modifications are lacking.
We examined the sources of error introduced by the T7 RNA polymerase based RNA amplification method through hybridisation studies on microarrays and performed statistical analysis of the parameters that need to be evaluated prior to routine laboratory use. The results demonstrate that amplification of the RNA has no systematic influence on the outcome of the microarray experiment. Although variations in differential expression between amplified and total RNA hybridisations can be observed, RNA amplification is reproducible, and there is no evidence that it introduces a large systematic bias.
Our results underline the utility of the T7 based RNA amplification for use in microarray experiments provided that all samples under study are equally treated.
The wealth of transcript information that has been made publicly available in recent years requires the development of high-throughput functional genomics and proteomics approaches for its analysis. Such approaches need suitable data integration procedures and a high level of automation in order to gain maximum benefit from the results generated. We have designed an automatic pipeline to analyse annotated open reading frames (ORFs) stemming from full-length cDNAs produced mainly by the German cDNA Consortium. The ORFs are cloned into expression vectors for use in large-scale assays such as the determination of subcellular protein localization or kinase reaction specificity. Additionally, all identified ORFs undergo exhaustive bioinformatic analysis such as similarity searches, protein domain architecture determination and prediction of physicochemical characteristics and secondary structure, using a wide variety of bioinformatic methods in combination with the most up-to-date public databases (e.g. PRINTS, BLOCKS, INTERPRO, PROSITE SWISSPROT). Data from experimental results and from the bioinformatic analysis are integrated and stored in a relational database (MS SQL-Server), which makes it possible for researchers to find answers to biological questions easily, thereby speeding up the selection of targets for further analysis. The designed pipeline constitutes a new automatic approach to obtaining and administrating relevant biological data from high-throughput investigations of cDNAs in order to systematically identify and characterize novel genes, as well as to comprehensively describe the function of the encoded proteins.
We have implemented LIFEdb (http://www.dkfz.de/LIFEdb) to link information regarding novel human full-length cDNAs generated and sequenced by the German cDNA Consortium with functional information on the encoded proteins produced in functional genomics and proteomics approaches. The database also serves as a sample-tracking system to manage the process from cDNA to experimental read-out and data interpretation. A web interface enables the scientific community to explore and visualize features of the annotated cDNAs and ORFs combined with experimental results, and thus helps to unravel new features of proteins with as yet unknown functions.