More reliable and faster prediction methods are needed to interpret enormous amounts of data generated by sequencing and genome projects. We have developed a new computational tool, PON-P2, for classification of amino acid substitutions in human proteins. The method is a machine learning-based classifier and groups the variants into pathogenic, neutral and unknown classes, on the basis of random forest probability score. PON-P2 is trained using pathogenic and neutral variants obtained from VariBench, a database for benchmark variation datasets. PON-P2 utilizes information about evolutionary conservation of sequences, physical and biochemical properties of amino acids, GO annotations and if available, functional annotations of variation sites. Extensive feature selection was performed to identify 8 informative features among altogether 622 features. PON-P2 consistently showed superior performance in comparison to existing state-of-the-art tools. In 10-fold cross-validation test, its accuracy and MCC are 0.90 and 0.80, respectively, and in the independent test, they are 0.86 and 0.71, respectively. The coverage of PON-P2 is 61.7% in the 10-fold cross-validation and 62.1% in the test dataset. PON-P2 is a powerful tool for screening harmful variants and for ranking and prioritizing experimental characterization. It is very fast making it capable of analyzing large variant datasets. PON-P2 is freely available at http://structure.bmc.lu.se/PON-P2/.
Imagine if we could compute across phenotype data as easily as genomic data; this article calls for efforts to realize this vision and discusses the potential benefits.
Despite a large and multifaceted effort to understand the vast landscape of phenotypic data, their current form inhibits productive data analysis. The lack of a community-wide, consensus-based, human- and machine-interpretable language for describing phenotypes and their genomic and environmental contexts is perhaps the most pressing scientific bottleneck to integration across many key fields in biology, including genomics, systems biology, development, medicine, evolution, ecology, and systematics. Here we survey the current phenomics landscape, including data resources and handling, and the progress that has been made to accurately capture relevant data descriptions for phenotypes. We present an example of the kind of integration across domains that computable phenotypes would enable, and we call upon the broader biology community, publishers, and relevant funding agencies to support efforts to surmount today's data barriers and facilitate analytical reproducibility.
Systematic representation of information related to genetic and non-genetic variations is required to allow large scale studies, data mining and data integration, and to make it possible to reveal novel relationships between genotype and phenotype. Although lots of variation data is available it is often difficult to use due to lack of systematics.
A novel ontology, Variation Ontology (VariO http://variationontology.org), was developed for annotation of effects, consequences and mechanisms of variations. In this article instructions are provided on how VariO annotations are made. The major levels for description are the three molecules, namely DNA, RNA and protein. They are further divided to four major sublevels: variation type, function, structure, and property, and further up to eight sublevels. VariO annotation summarizes existing knowledge about a variation and its effects and formalizes it so that computational analyses are efficient. The annotations should be made on as many levels as possible. VariO annotations are made in reference to normal states, which vary for each data item including e.g. reference sequences, wild type properties, and activities.
Detailed instructions together with examples are provided to indicate how VariO can be used for annotation of variations and their effects. A dedicated tool has been developed for annotation and will be further developed to cover also evidence for the annotations. VariO is suitable for annotation of data in many types of databases. As several different kinds of databases are in a process of adapting VariO annotations it is important to have guidelines to guarantee consistent annotation.
Variation ontology; Annotation instructions; Systematics; Variation effects; Mutation; Ontology
Data-driven studies on the dynamics of reconstructed protein-protein interaction (PPI) networks facilitate investigation and identification of proteins important for particular processes or diseases and reduces time and costs of experimental verification. Modeling the dynamics of very large PPI networks is computationally costly.
To circumvent this problem, we created a link-weighted human immunome interactome and performed filtering. We reconstructed the immunome interactome and weighed the links using jackknife gene expression correlation of integrated, time course gene expression data. Statistical significance of the links was computed using the Global Statistical Significance (GloSS) filtering algorithm. P-values from GloSS were computed for the integrated, time course gene expression data. We filtered the immunome interactome to identify core components of the T cell PPI network (TPPIN). The interconnectedness of the major pathways for T cell survival and response, including the T cell receptor, MAPK and JAK-STAT pathways, are maintained in the TPPIN network. The obtained TPPIN network is supported both by Gene Ontology term enrichment analysis along with study of essential genes enrichment.
By integrating gene expression data to the immunome interactome and using a weighted network filtering method, we identified the T cell PPI immune response network. This network reveals the most central and crucial network in T cells. The approach is general and applicable to any dataset that contains sufficient information.
Protein-protein interaction; Network; Filtering; T cell; TPPIN; Signaling; PPI
B cells play a pivotal role in adaptive immune system, since they maintain a delicate balance between recognition and clearance of foreign pathogens and tolerance to self. During maturation, B cells progress through a series of developmental stages defined by specific phenotypic surface markers and the rearrangement and expression of immunoglobulin (Ig) genes. To get insight into B cell proteome during the maturation pathway, we studied differential protein expression in eight human cell lines, which cover four distinctive developmental stages; early pre-B, pre-B, plasma cell and immature B cell upon anti-IgM stimulation. Our two-dimensional differential gel electrophoresis (2D-DIGE) and mass spectrometry based proteomic study indicates the involvement of large number of proteins with various functions. Notably, proteins related to cytoskeleton were relatively highly expressed in early pre-B and pre-B cells, whereas plasma cell proteome contained endoplasmic reticulum and Golgi system proteins. Our long time series analysis in anti-IgM stimulated Ramos B cells revealed the dynamic regulation of cytoskeleton organization, gene expression and metabolic pathways, among others. The findings are related to cellular processes in B cells and are discussed in relation to experimental information for the proteins and pathways they are involved in. Representative 2D-DIGE maps of different B cell maturation stages are available online at http://structure.bmc.lu.se/BcellProteome/.
1α,25-Dihydroxyvitamin D3 (1α,25(OH)2D3) had earlier been regarded as the only active hormone. The newly identified actions of 25-hydroxyvitamin D3 (25(OH)D3) and 24R,25-dihydroxyvitamin D3 (24R,25(OH)2D3) broadened the vitamin D3 endocrine system, however, the current data are fragmented and a systematic understanding is lacking. Here we performed the first systematic study of global gene expression to clarify their similarities and differences. Three metabolites at physiologically comparable levels were utilized to treat human and mouse fibroblasts prior to DNA microarray analyses. Human primary prostate stromal P29SN cells (hP29SN), which convert 25(OH)D3 into 1α,25(OH)2D3 by 1α-hydroxylase (encoded by the gene CYP27B1), displayed regulation of 164, 171, and 175 genes by treatment with 1α,25(OH)2D3, 25(OH)D3, and 24R,25(OH)2D3, respectively. Mouse primary Cyp27b1 knockout fibroblasts (mCyp27b1−/−), which lack 1α-hydroxylation, displayed regulation of 619, 469, and 66 genes using the same respective treatments. The number of shared genes regulated by two metabolites is much lower in hP29SN than in mCyp27b1−/−. By using DAVID Functional Annotation Bioinformatics Microarray Analysis tools and Ingenuity Pathways Analysis, we identified the agonistic regulation of calcium homeostasis and bone remodeling between 1α,25(OH)2D3 and 25(OH)D3 and unique non-classical actions of each metabolite in physiological and pathological processes, including cell cycle, keratinocyte differentiation, amyotrophic lateral sclerosis signaling, gene transcription, immunomodulation, epigenetics, cell differentiation, and membrane protein expression. In conclusion, there are three distinct vitamin D3 hormones with clearly different biological activities. This study presents a new conceptual insight into the vitamin D3 endocrine system, which may guide the strategic use of vitamin D3 in disease prevention and treatment.
Inherited factors predisposing individuals to breast and ovarian cancer are largely unidentified in a majority of families with hereditary breast and ovarian cancer (HBOC). We aimed to identify germline copy number variations (CNVs) contributing to HBOC susceptibility in the Finnish population.
A cohort of 84 HBOC individuals (negative for BRCA1/2-founder mutations and pre-screened for the most common breast cancer genes) and 36 healthy controls were analysed with a genome-wide SNP array. CNV-affecting genes were further studied by Gene Ontology term enrichment, pathway analyses, and database searches to reveal genes with potential for breast and ovarian cancer predisposition. CNVs that were considered to be important were validated and genotyped in 20 additional HBOC individuals (6 CNVs) and in additional healthy controls (5 CNVs) by qPCR.
An intronic deletion in the EPHA3 receptor tyrosine kinase was enriched in HBOC individuals (12 of 101, 11.9%) compared with controls (27 of 432, 6.3%) (OR = 1.96; P = 0.055). EPHA3 was identified in several enriched molecular functions including receptor activity. Both a novel intronic deletion in the CSMD1 tumor suppressor gene and a homozygous intergenic deletion at 5q15 were identified in 1 of 101 (1.0%) HBOC individuals but were very rare (1 of 436, 0.2% and 1 of 899, 0.1%, respectively) in healthy controls suggesting that these variants confer disease susceptibility.
This study reveals new information regarding the germline CNVs that likely contribute to HBOC susceptibility in Finland. This information may be used to facilitate the genetic counselling of HBOC individuals but the preliminary results warrant additional studies of a larger study group.
Sharing of data about variation and the associated phenotypes is a critical need, yet variant information can be arbitrarily complex, making a single standard vocabulary elusive and re-formatting difficult. Complex standards have proven too time-consuming to implement.
The GEN2PHEN project addressed these difficulties by developing a comprehensive data model for capturing biomedical observations, Observ-OM, and building the VarioML format around it. VarioML pairs a simplified open specification for describing variants, with a toolkit for adapting the specification into one's own research workflow. Straightforward variant data can be captured, federated, and exchanged with no overhead; more complex data can be described, without loss of compatibility. The open specification enables push-button submission to gene variant databases (LSDBs) e.g., the Leiden Open Variation Database, using the Cafe Variome data publishing service, while VarioML bidirectionally transforms data between XML and web-application code formats, opening up new possibilities for open source web applications building on shared data. A Java implementation toolkit makes VarioML easily integrated into biomedical applications. VarioML is designed primarily for LSDB data submission and transfer scenarios, but can also be used as a standard variation data format for JSON and XML document databases and user interface components.
VarioML is a set of tools and practices improving the availability, quality, and comprehensibility of human variation information. It enables researchers, diagnostic laboratories, and clinics to share that information with ease, clarity, and without ambiguity.
LSDB; Variation database curation; Data collection; Distribution
Prediction methods are increasingly used in biosciences to forecast diverse features and characteristics. Binary two-state classifiers are the most common applications. They are usually based on machine learning approaches. For the end user it is often problematic to evaluate the true performance and applicability of computational tools as some knowledge about computer science and statistics would be needed.
Instructions are given on how to interpret and compare method evaluation results. For systematic method performance analysis is needed established benchmark datasets which contain cases with known outcome, and suitable evaluation measures. The criteria for benchmark datasets are discussed along with their implementation in VariBench, benchmark database for variations. There is no single measure that alone could describe all the aspects of method performance. Predictions of genetic variation effects on DNA, RNA and protein level are important as information about variants can be produced much faster than their disease relevance can be experimentally verified. Therefore numerous prediction tools have been developed, however, systematic analyses of their performance and comparison have just started to emerge.
The end users of prediction tools should be able to understand how evaluation is done and how to interpret the results. Six main performance evaluation measures are introduced. These include sensitivity, specificity, positive predictive value, negative predictive value, accuracy and Matthews correlation coefficient. Together with receiver operating characteristics (ROC) analysis they provide a good picture about the performance of methods and allow their objective and quantitative comparison. A checklist of items to look at is provided. Comparisons of methods for missense variant tolerance, protein stability changes due to amino acid substitutions, and effects of variations on mRNA splicing are presented.
MuA transposase protein is a member of the retroviral integrase superfamily (RISF). It catalyzes DNA cleavage and joining reactions via an initial assembly and subsequent structural transitions of a protein-DNA complex, known as the Mu transpososome, ultimately attaching transposon DNA to non-specific target DNA. The transpososome functions as a molecular DNA-modifying machine and has been used in a wide variety of molecular biology and genetics/genomics applications. To analyze structure-function relationships in MuA action, a comprehensive pentapeptide insertion mutagenesis was carried out for the protein. A total of 233 unique insertion variants were generated, and their activity was analyzed using a quantitative in vivo DNA transposition assay. The results were then correlated with the known MuA structures, and the data were evaluated with regard to the protein domain function and transpososome development. To complement the analysis with an evolutionary component, a protein sequence alignment was produced for 44 members of MuA family transposases. Altogether, the results pinpointed those regions, in which insertions can be tolerated, and those where insertions are harmful. Most insertions within the subdomains Iγ, IIα, IIβ, and IIIα completely destroyed the transposase function, yet insertions into certain loop/linker regions of these subdomains increased the protein activity. Subdomains Iα and IIIβ were largely insertion-tolerant. The comprehensive structure-function data set will be useful for designing MuA transposase variants with improved properties for biotechnology/genomics applications, and is informative with regard to the function of RISF proteins in general.
The third Human Variome Project (HVP) Meeting “Integration and Implementation” was held under UNESCO Patronage in Paris, France, at the UNESCO Headquarters May 10–14, 2010. The major aims of the HVP are the collection, curation, and distribution of all human genetic variation affecting health. The HVP has drawn together disparate groups, by country, gene of interest, and expertise, who are working for the common good with the shared goal of pushing the boundaries of the human variome and collaborating to avoid unnecessary duplication. The meeting addressed the 12 key areas that form the current framework of HVP activities: Ethics; Nomenclature and Standards; Publication, Credit and Incentives; Data Collection from Clinics; Overall Data Integration and Access—Peripheral Systems/Software; Data Collection from Laboratories; Assessment of Pathogenicity; Country Specific Collection; Translation to Healthcare and Personalized Medicine; Data Transfer, Databasing, and Curation; Overall Data Integration and Access—Central Systems; and Funding Mechanisms and Sustainability. In addition, three societies that support the goals and the mission of HVP also held their own Workshops with the view to advance disease-specific variation data collection and utilization: the International Society for Gastrointestinal Hereditary Tumours, the Micronutrient Genomics Project, and the Neurogenetics Consortium.
mutation; variation; genomics; genetic disease
Several predisposition loci for hereditary prostate cancer (HPC) have been suggested, including HPCX1 at Xq27-q28, but due to the complex structure of the region, the susceptibility gene has not yet been identified.
In this study, nonsense-mediated mRNA decay (NMD) inhibition was used for the discovery of truncating mutations. Six prostate cancer (PC) patients and their healthy brothers were selected from a group of HPCX1-linked families. Expression analyses were done using Agilent 44 K oligoarrays, and selected genes were screened for mutations by direct sequencing. In addition, microRNA expression levels in the lymphoblastic cells were analyzed to trace variants that might alter miRNA expression and explain partly an inherited genetic predisposion to PC.
Seventeen genes were selected for resequencing based on the NMD array, but no truncating mutations were found. The most interesting variant was MAGEC1 p.Met1?. An association was seen between the variant and unselected PC (OR = 2.35, 95% CI = 1.10-5.02) and HPC (OR = 3.38, 95% CI = 1.10-10.40). miRNA analysis revealed altogether 29 miRNAs with altered expression between the PC cases and controls. miRNA target analysis revealed that 12 of them also had possible target sites in the MAGEC1 gene. These miRNAs were selected for validation process including four miRNAs located in the X chromosome. The expressions of 14 miRNAs were validated in families that contributed to the significant signal differences in Agilent arrays.
Further functional studies are needed to fully understand the possible contribution of these miRNAs and MAGEC1 start codon variant to PC.
EGFR-MEK-ERK signaling pathway has an established role in promoting malignant growth and disease progression in human cancers. Therefore identification of transcriptional targets mediating the oncogenic effects of the EGFR-MEK-ERK pathway would be highly relevant. Cancerous inhibitor of protein phosphatase 2A (CIP2A) is a recently characterized human oncoprotein. CIP2A promotes malignant cell growth and is over expressed at high frequency (40–80%) in most of the human cancer types. However, the mechanisms inducing its expression in cancer still remain largely unexplored. Here we present systematic analysis of contribution of potential gene regulatory mechanisms for high CIP2A expression in cancer. Our data shows that evolutionary conserved CpG islands at the proximal CIP2A promoter are not methylated both in normal and cancer cells. Furthermore, sequencing of the active CIP2A promoter region from altogether seven normal and malignant cell types did not reveal any sequence alterations that would increase CIP2A expression specifically in cancer cells. However, treatment of cancer cells with various signaling pathway inhibitors revealed that CIP2A mRNA expression was sensitive to inhibition of EGFR activity as well as inhibition or activation of MEK-ERK pathway. Moreover, MEK1/2-specific siRNAs decreased CIP2A protein expression. Series of CIP2A promoter-luciferase constructs were created to identify proximal −27 to −107 promoter region responsible for MEK-dependent stimulation of CIP2A expression. Additional mutagenesis and chromatin immunoprecipitation experiments revealed ETS1 as the transcription factor mediating stimulation of CIP2A expression through EGFR-MEK pathway. Thus, ETS1 is probably mediating high CIP2A expression in human cancers with increased EGFR-MEK1/2-ERK pathway activity. These results also suggest that in addition to its established role in invasion and angiogenesis, ETS1 may support malignant cellular growth via regulation of CIP2A expression and protein phosphatase 2A inhibition.
Two major high-penetrance breast cancer genes, BRCA1 and BRCA2, are responsible for approximately 20% of hereditary breast cancer (HBC) cases in Finland. Additionally, rare mutations in several other genes that interact with BRCA1 and BRCA2 increase the risk of HBC. Still, a majority of HBC cases remain unexplained which is challenging for genetic counseling. We aimed to analyze additional mutations in HBC-associated genes and to define the sensitivity of our current BRCA1/2 mutation analysis protocol used in genetic counseling.
Eighty-two well-characterized, high-risk hereditary breast and/or ovarian cancer (HBOC) BRCA1/2-founder mutation-negative Finnish individuals, were screened for germline alterations in seven breast cancer susceptibility genes, BRCA1, BRCA2, CHEK2, PALB2, BRIP1, RAD50, and CDH1. BRCA1/2 were analyzed by multiplex ligation-dependent probe amplification (MLPA) and direct sequencing. CHEK2 was analyzed by the high resolution melt (HRM) method and PALB2, RAD50, BRIP1 and CDH1 were analyzed by direct sequencing. Carrier frequencies between 82 (HBOC) BRCA1/2-founder mutation-negative Finnish individuals and 384 healthy Finnish population controls were compared by using Fisher's exact test. In silico prediction for novel missense variants effects was carried out by using Pathogenic-Or-Not -Pipeline (PON-P).
Three previously reported breast cancer-associated variants, BRCA1 c.5095C > T, CHEK2 c.470T > C, and CHEK2 c.1100delC, were observed in eleven (13.4%) individuals. Ten of these individuals (12.2%) had CHEK2 variants, c.470T > C and/or c.1100delC. Fourteen novel sequence alterations and nine individuals with more than one non-synonymous variant were identified. One of the novel variants, BRCA2 c.72A > T (Leu24Phe) was predicted to be likely pathogenic in silico. No large genomic rearrangements were detected in BRCA1/2 by multiplex ligation-dependent probe amplification (MLPA).
In this study, mutations in previously known breast cancer susceptibility genes can explain 13.4% of the analyzed high-risk BRCA1/2-negative HBOC individuals. CHEK2 mutations, c.470T > C and c.1100delC, make a considerable contribution (12.2%) to these high-risk individuals but further segregation analysis is needed to evaluate the clinical significance of these mutations before applying them in clinical use. Additionally, we identified novel variants that warrant additional studies. Our current genetic testing protocol for 28 Finnish BRCA1/2-founder mutations and protein truncation test (PTT) of the largest exons is sensitive enough for clinical use as a primary screening tool.
Subcellular localization is an important protein property, which is related to function, interactions and other features. As experimental determination of the localization can be tedious, especially for large numbers of proteins, a number of prediction tools have been developed. We developed the PROlocalizer service that integrates 11 individual methods to predict altogether 12 localizations for animal proteins. The method allows the submission of a number of proteins and mutations and generates a detailed informative document of the prediction and obtained results. PROlocalizer is available at http://bioinf.uta.fi/PROlocalizer/.
Protein localization prediction; Cell compartments; Mutations; Disease-causing mutations; Prediction method
prostatic; neoplasia; chromosome; aberration; clonal
Eukaryotic cells contain numerous compartments, which have different protein constituents. Proteins are typically directed to compartments by short peptide sequences that act as targeting signals. Translocation to the proper compartment allows a protein to form the necessary interactions with its partners and take part in biological networks such as signalling and metabolic pathways. If a protein is not transported to the correct intracellular compartment either the reaction performed or information carried by the protein does not reach the proper site, causing either inactivation of central reactions or misregulation of signalling cascades, or the mislocalized active protein has harmful effects by acting in the wrong place.
Numerous methods have been developed to predict protein subcellular localization with quite high accuracy. We applied bioinformatics methods to investigate the effects of known disease-related mutations on protein targeting and localization by analyzing over 22,000 missense mutations in more than 1,500 proteins with two complementary prediction approaches. Several hundred putative localization affecting mutations were identified and investigated statistically.
Although alterations to localization signals are rare, these effects should be taken into account when analyzing the consequences of disease-related mutations.
Functioning of the immune system requires the coordinated expression and action of many genes and proteins. With the emergence of high-throughput technologies, a great amount of molecular data is available for the genes and proteins of the immune system. However, these data are scattered into several databases and literature and therefore integration is needed.
The Immunome Knowledge Base (IKB) is a dedicated resource for immunological information. We identified and collected genes that are essential for the immunome. Nucleotide and protein sequences, as well as information about the related pseudogenes are available for 893 human essential immunome genes. To allow the study of the evolution of the immune system, data for the orthologs of human genes was collected. In addition to the human immunome, ortholog groups of 1811 metazoan immunity genes are available with information about the evidence of their immunity function. IKB combines three previous databases and several additional data items in an integrated system.
IKB provides in one single service access to several databases and resources and contains plenty of new data about immune system. The most recent addition is variation data on genomic, transcriptomic and proteomic levels for all the immunome genes and proteins. In the future, more data will be added on the function of these genes. The service has a free and public web interface.
Disturbed cellular cholesterol homeostasis may lead to accumulation of cholesterol in human atheroma plaques. Cellular cholesterol homeostasis is controlled by the sterol regulatory element-binding transcription factor 2 (SREBF-2) and the SREBF cleavage-activating protein (SCAP). We investigated whole genome expression in a series of human atherosclerotic samples from different vascular territories and studied whether the non-synonymous coding variants in the interacting domains of two genes, SREBF-2 1784G>C (rs2228314) and SCAP 2386A>G, are related to the progression of coronary atherosclerosis and the risk of pre-hospital sudden cardiac death (SCD).
Whole genome expression profiling was completed in twenty vascular samples from carotid, aortic and femoral atherosclerotic plaques and six control samples from internal mammary arteries. Three hundred sudden pre-hospital deaths of middle-aged (33–69 years) Caucasian Finnish men were subjected to detailed autopsy in the Helsinki Sudden Death Study. Coronary narrowing and areas of coronary wall covered with fatty streaks or fibrotic, calcified or complicated lesions were measured and related to the SREBF-2 and SCAP genotypes.
Whole genome expression profiling showed a significant (p = 0.02) down-regulation of SREBF-2 in atherosclerotic carotid plaques (types IV-V), but not in the aorta or femoral arteries (p = NS for both), as compared with the histologically confirmed non-atherosclerotic tissues. In logistic regression analysis, a significant interaction between the SREBF-2 1784G>C and the SCAP 2386A>G genotype was observed on the risk of SCD (p = 0.046). Men with the SREBF-2 C allele and the SCAP G allele had a significantly increased risk of SCD (OR 2.68, 95% CI 1.07–6.71), compared to SCAP AA homologous subjects carrying the SREBF-2 C allele. Furthermore, similar trends for having complicated lesions and for the occurrence of thrombosis were found, although the results were not statistically significant.
The results suggest that the allelic variants (SREBF-2 1784G>C and SCAP 2386A>G) in the cholesterol homeostasis regulating SREBF-SCAP pathway may contribute to SCD in early middle-aged men.
Disease gene identification is still a challenge despite modern high-throughput methods. Many diseases are very rare or lethal and thus cannot be investigated with traditional methods. Several in silico methods have been developed but they have some limitations. We introduce a new method that combines information about protein-interaction network properties and Gene Ontology terms. Genes with high-calculated network scores and statistically significant gene ontology terms based on known diseases are prioritized as candidate genes. The method was applied to identify novel primary immunodeficiency-related genes, 26 of which were found. The investigation uses the protein-interaction network for all essential immunome human genes available in the Immunome Knowledge Base and an analysis of their enriched gene ontology annotations. The identified disease gene candidates are mainly involved in cellular signaling including receptors, protein kinases and adaptor and binding proteins as well as enzymes. The method can be generalized for any disease group with sufficient information.
Pseudogenes, nonfunctional copies of genes, evolve fast due the lack of evolutionary pressures and thus appear in several different forms. PseudoGeneQuest is an online tool to search the human genome for a given query sequence and to identify different types of pseudogenes as well as novel genes and gene fragments.
The service can detect pseudogenes, that have arisen either by retrotransposition or segmental genome duplication, many of which are not listed in the public pseudogene databases. The service has a user-friendly web interface and uses a powerful computer cluster in order to perform parallel searches and provide relatively fast runtimes despite exhaustive database searches and analyses.
PseudoGeneQuest is a versatile tool for detecting novel pseudogene candidates from the human genome. The service searches human genome sequences for five types of pseudogenes and provides an output that allows easy further analysis of observations. In addition to the result file the system provides visualization of the results linked to Ensembl Genome Browser. PseudoGeneQuest service is freely available.
Details of the mechanisms and selection pressures that shape the emergence and development of complex biological systems, such as the human immune system, are poorly understood. A recent definition of a reference set of proteins essential for the human immunome, combined with information about protein interaction networks for these proteins, facilitates evolutionary study of this biological machinery.
Here, we present a detailed study of the development of the immunome protein interaction network during eight evolutionary steps from Bilateria ancestors to human. New nodes show preferential attachment to high degree proteins. The efficiency of the immunome protein interaction network increases during the evolutionary steps, whereas the vulnerability of the network decreases.
Our results shed light on selective forces acting on the emergence of biological networks. It is likely that the high efficiency and low vulnerability are intrinsic properties of many biological networks, which arise from the effects of evolutionary processes yet to be uncovered.
Most genetic disorders are linked to missense mutations as even minor changes in the size or properties of an amino acid can alter or prevent the function of the protein. Further, the effect of a mutation is also dependent on the sequence and structure context of the alteration.
We investigated the spectrum of disease-causing missense mutations in secondary structure elements in proteins with numerous known mutations and for which an experimentally defined three-dimensional structure is available. We obtained a comprehensive map of the differences in mutation frequencies, location and contact energies, and the changes in residue volume and charge – both in the mutated (original) amino acids and in the mutant amino acids in the different secondary structure types. We collected information for 44 different proteins involved in a large number of diseases. The studied proteins contained a total of 2413 mutations of which 1935 (80%) appeared in secondary structures. Differences in mutation patterns between secondary structures and whole proteins were generally not statistically significant whereas within the secondary structural elements numerous highly significant features were observed.
Numerous trends in mutated and mutant amino acids are apparent. Among the original residues, arginine clearly has the highest relative mutability. The overall relative mutability among mutant residues is highest for cysteine and tryptophan. The mutability values are higher for mutated residues than for mutant residues. Arginine and glycine are among the most mutated residues in all secondary structures whereas the other amino acids have large variations in mutability between structure types. Statistical analysis was used to reveal trends in different secondary structural elements, residue types as well as for the charge and volume changes.
Understanding networks of protein–protein interactions constitutes an essential component on a path towards comprehensive description of cell function. Whereas efficient techniques are readily available for the initial identification of interacting protein partners, practical strategies are lacking for the subsequent high-resolution mapping of regions involved in protein–protein interfaces. We present here a genetic strategy to accurately map interacting protein regions at amino acid precision. The system is based on parallel construction, sampling and analysis of a comprehensive insertion mutant library. The methodology integrates Mu in vitro transposition-based random pentapeptide mutagenesis of proteins, yeast two-hybrid screening and high-resolution genetic footprinting. The strategy is general and applicable to any interacting protein pair. We demonstrate the feasibility of the methodology by mapping the region in human JFC1 that interacts with Rab8A, and we show that the association is mediated by the Slp homology domain 1.