Sharing of data about variation and the associated phenotypes is a critical need, yet variant information can be arbitrarily complex, making a single standard vocabulary elusive and re-formatting difficult. Complex standards have proven too time-consuming to implement.
The GEN2PHEN project addressed these difficulties by developing a comprehensive data model for capturing biomedical observations, Observ-OM, and building the VarioML format around it. VarioML pairs a simplified open specification for describing variants, with a toolkit for adapting the specification into one's own research workflow. Straightforward variant data can be captured, federated, and exchanged with no overhead; more complex data can be described, without loss of compatibility. The open specification enables push-button submission to gene variant databases (LSDBs) e.g., the Leiden Open Variation Database, using the Cafe Variome data publishing service, while VarioML bidirectionally transforms data between XML and web-application code formats, opening up new possibilities for open source web applications building on shared data. A Java implementation toolkit makes VarioML easily integrated into biomedical applications. VarioML is designed primarily for LSDB data submission and transfer scenarios, but can also be used as a standard variation data format for JSON and XML document databases and user interface components.
VarioML is a set of tools and practices improving the availability, quality, and comprehensibility of human variation information. It enables researchers, diagnostic laboratories, and clinics to share that information with ease, clarity, and without ambiguity.
LSDB; Variation database curation; Data collection; Distribution
Prediction methods are increasingly used in biosciences to forecast diverse features and characteristics. Binary two-state classifiers are the most common applications. They are usually based on machine learning approaches. For the end user it is often problematic to evaluate the true performance and applicability of computational tools as some knowledge about computer science and statistics would be needed.
Instructions are given on how to interpret and compare method evaluation results. For systematic method performance analysis is needed established benchmark datasets which contain cases with known outcome, and suitable evaluation measures. The criteria for benchmark datasets are discussed along with their implementation in VariBench, benchmark database for variations. There is no single measure that alone could describe all the aspects of method performance. Predictions of genetic variation effects on DNA, RNA and protein level are important as information about variants can be produced much faster than their disease relevance can be experimentally verified. Therefore numerous prediction tools have been developed, however, systematic analyses of their performance and comparison have just started to emerge.
The end users of prediction tools should be able to understand how evaluation is done and how to interpret the results. Six main performance evaluation measures are introduced. These include sensitivity, specificity, positive predictive value, negative predictive value, accuracy and Matthews correlation coefficient. Together with receiver operating characteristics (ROC) analysis they provide a good picture about the performance of methods and allow their objective and quantitative comparison. A checklist of items to look at is provided. Comparisons of methods for missense variant tolerance, protein stability changes due to amino acid substitutions, and effects of variations on mRNA splicing are presented.
MuA transposase protein is a member of the retroviral integrase superfamily (RISF). It catalyzes DNA cleavage and joining reactions via an initial assembly and subsequent structural transitions of a protein-DNA complex, known as the Mu transpososome, ultimately attaching transposon DNA to non-specific target DNA. The transpososome functions as a molecular DNA-modifying machine and has been used in a wide variety of molecular biology and genetics/genomics applications. To analyze structure-function relationships in MuA action, a comprehensive pentapeptide insertion mutagenesis was carried out for the protein. A total of 233 unique insertion variants were generated, and their activity was analyzed using a quantitative in vivo DNA transposition assay. The results were then correlated with the known MuA structures, and the data were evaluated with regard to the protein domain function and transpososome development. To complement the analysis with an evolutionary component, a protein sequence alignment was produced for 44 members of MuA family transposases. Altogether, the results pinpointed those regions, in which insertions can be tolerated, and those where insertions are harmful. Most insertions within the subdomains Iγ, IIα, IIβ, and IIIα completely destroyed the transposase function, yet insertions into certain loop/linker regions of these subdomains increased the protein activity. Subdomains Iα and IIIβ were largely insertion-tolerant. The comprehensive structure-function data set will be useful for designing MuA transposase variants with improved properties for biotechnology/genomics applications, and is informative with regard to the function of RISF proteins in general.
The third Human Variome Project (HVP) Meeting “Integration and Implementation” was held under UNESCO Patronage in Paris, France, at the UNESCO Headquarters May 10–14, 2010. The major aims of the HVP are the collection, curation, and distribution of all human genetic variation affecting health. The HVP has drawn together disparate groups, by country, gene of interest, and expertise, who are working for the common good with the shared goal of pushing the boundaries of the human variome and collaborating to avoid unnecessary duplication. The meeting addressed the 12 key areas that form the current framework of HVP activities: Ethics; Nomenclature and Standards; Publication, Credit and Incentives; Data Collection from Clinics; Overall Data Integration and Access—Peripheral Systems/Software; Data Collection from Laboratories; Assessment of Pathogenicity; Country Specific Collection; Translation to Healthcare and Personalized Medicine; Data Transfer, Databasing, and Curation; Overall Data Integration and Access—Central Systems; and Funding Mechanisms and Sustainability. In addition, three societies that support the goals and the mission of HVP also held their own Workshops with the view to advance disease-specific variation data collection and utilization: the International Society for Gastrointestinal Hereditary Tumours, the Micronutrient Genomics Project, and the Neurogenetics Consortium.
mutation; variation; genomics; genetic disease
Several predisposition loci for hereditary prostate cancer (HPC) have been suggested, including HPCX1 at Xq27-q28, but due to the complex structure of the region, the susceptibility gene has not yet been identified.
In this study, nonsense-mediated mRNA decay (NMD) inhibition was used for the discovery of truncating mutations. Six prostate cancer (PC) patients and their healthy brothers were selected from a group of HPCX1-linked families. Expression analyses were done using Agilent 44 K oligoarrays, and selected genes were screened for mutations by direct sequencing. In addition, microRNA expression levels in the lymphoblastic cells were analyzed to trace variants that might alter miRNA expression and explain partly an inherited genetic predisposion to PC.
Seventeen genes were selected for resequencing based on the NMD array, but no truncating mutations were found. The most interesting variant was MAGEC1 p.Met1?. An association was seen between the variant and unselected PC (OR = 2.35, 95% CI = 1.10-5.02) and HPC (OR = 3.38, 95% CI = 1.10-10.40). miRNA analysis revealed altogether 29 miRNAs with altered expression between the PC cases and controls. miRNA target analysis revealed that 12 of them also had possible target sites in the MAGEC1 gene. These miRNAs were selected for validation process including four miRNAs located in the X chromosome. The expressions of 14 miRNAs were validated in families that contributed to the significant signal differences in Agilent arrays.
Further functional studies are needed to fully understand the possible contribution of these miRNAs and MAGEC1 start codon variant to PC.
EGFR-MEK-ERK signaling pathway has an established role in promoting malignant growth and disease progression in human cancers. Therefore identification of transcriptional targets mediating the oncogenic effects of the EGFR-MEK-ERK pathway would be highly relevant. Cancerous inhibitor of protein phosphatase 2A (CIP2A) is a recently characterized human oncoprotein. CIP2A promotes malignant cell growth and is over expressed at high frequency (40–80%) in most of the human cancer types. However, the mechanisms inducing its expression in cancer still remain largely unexplored. Here we present systematic analysis of contribution of potential gene regulatory mechanisms for high CIP2A expression in cancer. Our data shows that evolutionary conserved CpG islands at the proximal CIP2A promoter are not methylated both in normal and cancer cells. Furthermore, sequencing of the active CIP2A promoter region from altogether seven normal and malignant cell types did not reveal any sequence alterations that would increase CIP2A expression specifically in cancer cells. However, treatment of cancer cells with various signaling pathway inhibitors revealed that CIP2A mRNA expression was sensitive to inhibition of EGFR activity as well as inhibition or activation of MEK-ERK pathway. Moreover, MEK1/2-specific siRNAs decreased CIP2A protein expression. Series of CIP2A promoter-luciferase constructs were created to identify proximal −27 to −107 promoter region responsible for MEK-dependent stimulation of CIP2A expression. Additional mutagenesis and chromatin immunoprecipitation experiments revealed ETS1 as the transcription factor mediating stimulation of CIP2A expression through EGFR-MEK pathway. Thus, ETS1 is probably mediating high CIP2A expression in human cancers with increased EGFR-MEK1/2-ERK pathway activity. These results also suggest that in addition to its established role in invasion and angiogenesis, ETS1 may support malignant cellular growth via regulation of CIP2A expression and protein phosphatase 2A inhibition.
Two major high-penetrance breast cancer genes, BRCA1 and BRCA2, are responsible for approximately 20% of hereditary breast cancer (HBC) cases in Finland. Additionally, rare mutations in several other genes that interact with BRCA1 and BRCA2 increase the risk of HBC. Still, a majority of HBC cases remain unexplained which is challenging for genetic counseling. We aimed to analyze additional mutations in HBC-associated genes and to define the sensitivity of our current BRCA1/2 mutation analysis protocol used in genetic counseling.
Eighty-two well-characterized, high-risk hereditary breast and/or ovarian cancer (HBOC) BRCA1/2-founder mutation-negative Finnish individuals, were screened for germline alterations in seven breast cancer susceptibility genes, BRCA1, BRCA2, CHEK2, PALB2, BRIP1, RAD50, and CDH1. BRCA1/2 were analyzed by multiplex ligation-dependent probe amplification (MLPA) and direct sequencing. CHEK2 was analyzed by the high resolution melt (HRM) method and PALB2, RAD50, BRIP1 and CDH1 were analyzed by direct sequencing. Carrier frequencies between 82 (HBOC) BRCA1/2-founder mutation-negative Finnish individuals and 384 healthy Finnish population controls were compared by using Fisher's exact test. In silico prediction for novel missense variants effects was carried out by using Pathogenic-Or-Not -Pipeline (PON-P).
Three previously reported breast cancer-associated variants, BRCA1 c.5095C > T, CHEK2 c.470T > C, and CHEK2 c.1100delC, were observed in eleven (13.4%) individuals. Ten of these individuals (12.2%) had CHEK2 variants, c.470T > C and/or c.1100delC. Fourteen novel sequence alterations and nine individuals with more than one non-synonymous variant were identified. One of the novel variants, BRCA2 c.72A > T (Leu24Phe) was predicted to be likely pathogenic in silico. No large genomic rearrangements were detected in BRCA1/2 by multiplex ligation-dependent probe amplification (MLPA).
In this study, mutations in previously known breast cancer susceptibility genes can explain 13.4% of the analyzed high-risk BRCA1/2-negative HBOC individuals. CHEK2 mutations, c.470T > C and c.1100delC, make a considerable contribution (12.2%) to these high-risk individuals but further segregation analysis is needed to evaluate the clinical significance of these mutations before applying them in clinical use. Additionally, we identified novel variants that warrant additional studies. Our current genetic testing protocol for 28 Finnish BRCA1/2-founder mutations and protein truncation test (PTT) of the largest exons is sensitive enough for clinical use as a primary screening tool.
Subcellular localization is an important protein property, which is related to function, interactions and other features. As experimental determination of the localization can be tedious, especially for large numbers of proteins, a number of prediction tools have been developed. We developed the PROlocalizer service that integrates 11 individual methods to predict altogether 12 localizations for animal proteins. The method allows the submission of a number of proteins and mutations and generates a detailed informative document of the prediction and obtained results. PROlocalizer is available at http://bioinf.uta.fi/PROlocalizer/.
Protein localization prediction; Cell compartments; Mutations; Disease-causing mutations; Prediction method
prostatic; neoplasia; chromosome; aberration; clonal
Eukaryotic cells contain numerous compartments, which have different protein constituents. Proteins are typically directed to compartments by short peptide sequences that act as targeting signals. Translocation to the proper compartment allows a protein to form the necessary interactions with its partners and take part in biological networks such as signalling and metabolic pathways. If a protein is not transported to the correct intracellular compartment either the reaction performed or information carried by the protein does not reach the proper site, causing either inactivation of central reactions or misregulation of signalling cascades, or the mislocalized active protein has harmful effects by acting in the wrong place.
Numerous methods have been developed to predict protein subcellular localization with quite high accuracy. We applied bioinformatics methods to investigate the effects of known disease-related mutations on protein targeting and localization by analyzing over 22,000 missense mutations in more than 1,500 proteins with two complementary prediction approaches. Several hundred putative localization affecting mutations were identified and investigated statistically.
Although alterations to localization signals are rare, these effects should be taken into account when analyzing the consequences of disease-related mutations.
Functioning of the immune system requires the coordinated expression and action of many genes and proteins. With the emergence of high-throughput technologies, a great amount of molecular data is available for the genes and proteins of the immune system. However, these data are scattered into several databases and literature and therefore integration is needed.
The Immunome Knowledge Base (IKB) is a dedicated resource for immunological information. We identified and collected genes that are essential for the immunome. Nucleotide and protein sequences, as well as information about the related pseudogenes are available for 893 human essential immunome genes. To allow the study of the evolution of the immune system, data for the orthologs of human genes was collected. In addition to the human immunome, ortholog groups of 1811 metazoan immunity genes are available with information about the evidence of their immunity function. IKB combines three previous databases and several additional data items in an integrated system.
IKB provides in one single service access to several databases and resources and contains plenty of new data about immune system. The most recent addition is variation data on genomic, transcriptomic and proteomic levels for all the immunome genes and proteins. In the future, more data will be added on the function of these genes. The service has a free and public web interface.
Disturbed cellular cholesterol homeostasis may lead to accumulation of cholesterol in human atheroma plaques. Cellular cholesterol homeostasis is controlled by the sterol regulatory element-binding transcription factor 2 (SREBF-2) and the SREBF cleavage-activating protein (SCAP). We investigated whole genome expression in a series of human atherosclerotic samples from different vascular territories and studied whether the non-synonymous coding variants in the interacting domains of two genes, SREBF-2 1784G>C (rs2228314) and SCAP 2386A>G, are related to the progression of coronary atherosclerosis and the risk of pre-hospital sudden cardiac death (SCD).
Whole genome expression profiling was completed in twenty vascular samples from carotid, aortic and femoral atherosclerotic plaques and six control samples from internal mammary arteries. Three hundred sudden pre-hospital deaths of middle-aged (33–69 years) Caucasian Finnish men were subjected to detailed autopsy in the Helsinki Sudden Death Study. Coronary narrowing and areas of coronary wall covered with fatty streaks or fibrotic, calcified or complicated lesions were measured and related to the SREBF-2 and SCAP genotypes.
Whole genome expression profiling showed a significant (p = 0.02) down-regulation of SREBF-2 in atherosclerotic carotid plaques (types IV-V), but not in the aorta or femoral arteries (p = NS for both), as compared with the histologically confirmed non-atherosclerotic tissues. In logistic regression analysis, a significant interaction between the SREBF-2 1784G>C and the SCAP 2386A>G genotype was observed on the risk of SCD (p = 0.046). Men with the SREBF-2 C allele and the SCAP G allele had a significantly increased risk of SCD (OR 2.68, 95% CI 1.07–6.71), compared to SCAP AA homologous subjects carrying the SREBF-2 C allele. Furthermore, similar trends for having complicated lesions and for the occurrence of thrombosis were found, although the results were not statistically significant.
The results suggest that the allelic variants (SREBF-2 1784G>C and SCAP 2386A>G) in the cholesterol homeostasis regulating SREBF-SCAP pathway may contribute to SCD in early middle-aged men.
Disease gene identification is still a challenge despite modern high-throughput methods. Many diseases are very rare or lethal and thus cannot be investigated with traditional methods. Several in silico methods have been developed but they have some limitations. We introduce a new method that combines information about protein-interaction network properties and Gene Ontology terms. Genes with high-calculated network scores and statistically significant gene ontology terms based on known diseases are prioritized as candidate genes. The method was applied to identify novel primary immunodeficiency-related genes, 26 of which were found. The investigation uses the protein-interaction network for all essential immunome human genes available in the Immunome Knowledge Base and an analysis of their enriched gene ontology annotations. The identified disease gene candidates are mainly involved in cellular signaling including receptors, protein kinases and adaptor and binding proteins as well as enzymes. The method can be generalized for any disease group with sufficient information.
Pseudogenes, nonfunctional copies of genes, evolve fast due the lack of evolutionary pressures and thus appear in several different forms. PseudoGeneQuest is an online tool to search the human genome for a given query sequence and to identify different types of pseudogenes as well as novel genes and gene fragments.
The service can detect pseudogenes, that have arisen either by retrotransposition or segmental genome duplication, many of which are not listed in the public pseudogene databases. The service has a user-friendly web interface and uses a powerful computer cluster in order to perform parallel searches and provide relatively fast runtimes despite exhaustive database searches and analyses.
PseudoGeneQuest is a versatile tool for detecting novel pseudogene candidates from the human genome. The service searches human genome sequences for five types of pseudogenes and provides an output that allows easy further analysis of observations. In addition to the result file the system provides visualization of the results linked to Ensembl Genome Browser. PseudoGeneQuest service is freely available.
Details of the mechanisms and selection pressures that shape the emergence and development of complex biological systems, such as the human immune system, are poorly understood. A recent definition of a reference set of proteins essential for the human immunome, combined with information about protein interaction networks for these proteins, facilitates evolutionary study of this biological machinery.
Here, we present a detailed study of the development of the immunome protein interaction network during eight evolutionary steps from Bilateria ancestors to human. New nodes show preferential attachment to high degree proteins. The efficiency of the immunome protein interaction network increases during the evolutionary steps, whereas the vulnerability of the network decreases.
Our results shed light on selective forces acting on the emergence of biological networks. It is likely that the high efficiency and low vulnerability are intrinsic properties of many biological networks, which arise from the effects of evolutionary processes yet to be uncovered.
Most genetic disorders are linked to missense mutations as even minor changes in the size or properties of an amino acid can alter or prevent the function of the protein. Further, the effect of a mutation is also dependent on the sequence and structure context of the alteration.
We investigated the spectrum of disease-causing missense mutations in secondary structure elements in proteins with numerous known mutations and for which an experimentally defined three-dimensional structure is available. We obtained a comprehensive map of the differences in mutation frequencies, location and contact energies, and the changes in residue volume and charge – both in the mutated (original) amino acids and in the mutant amino acids in the different secondary structure types. We collected information for 44 different proteins involved in a large number of diseases. The studied proteins contained a total of 2413 mutations of which 1935 (80%) appeared in secondary structures. Differences in mutation patterns between secondary structures and whole proteins were generally not statistically significant whereas within the secondary structural elements numerous highly significant features were observed.
Numerous trends in mutated and mutant amino acids are apparent. Among the original residues, arginine clearly has the highest relative mutability. The overall relative mutability among mutant residues is highest for cysteine and tryptophan. The mutability values are higher for mutated residues than for mutant residues. Arginine and glycine are among the most mutated residues in all secondary structures whereas the other amino acids have large variations in mutability between structure types. Statistical analysis was used to reveal trends in different secondary structural elements, residue types as well as for the charge and volume changes.
Understanding networks of protein–protein interactions constitutes an essential component on a path towards comprehensive description of cell function. Whereas efficient techniques are readily available for the initial identification of interacting protein partners, practical strategies are lacking for the subsequent high-resolution mapping of regions involved in protein–protein interfaces. We present here a genetic strategy to accurately map interacting protein regions at amino acid precision. The system is based on parallel construction, sampling and analysis of a comprehensive insertion mutant library. The methodology integrates Mu in vitro transposition-based random pentapeptide mutagenesis of proteins, yeast two-hybrid screening and high-resolution genetic footprinting. The strategy is general and applicable to any interacting protein pair. We demonstrate the feasibility of the methodology by mapping the region in human JFC1 that interacts with Rab8A, and we show that the association is mediated by the Slp homology domain 1.
The ImmunoDeficiency Resource (IDR) is a knowledge base for the integration of the clinical, biochemical, genetic, genomic, proteomic, structural, and computational data of primary immunodeficiencies. The need for the IDR arises from the lack of structured and systematic information about primary immunodeficiencies on the Internet, and from the lack of a common platform which enables doctors, researchers, students, nurses and patients to find out validated information about these diseases.
The IDR knowledge base, first released in 1999, has grown substantially. It contains information for 158 diseases, both from a clinical as well as molecular point of view. The database and the user interface have been reformatted. This new IDR release has a richer and more complete breadth, depth and scope. The service provides the most complete and up-to-date dataset. The IDR has been integrated with several internal and external databases and services. The contents of the IDR are validated and selected for different types of users (doctors, nurses, researchers and students, as well as patients and their families). The search engine has been improved and allows either a detailed or a broad search from a simple user interface.
The IDR is the first knowledge base specifically designed to capture in a systematic and validated way both clinical and molecular information for primary immunodeficiencies. The service is freely available at http://bioinf.uta.fi/idr and is regularly updated. The IDR facilitates primary immunodeficiencies informatics and helps to parameterise in silico modelling of these diseases. The IDR is useful also as an advanced education tool for medical students, and physicians.
The immune system, which is a complex machinery, is based on the highly coordinated expression of a wide array of genes and proteins. The evolutionary history of the human immune system is not well characterised. Although several studies related to the development and evolution of immunological processes have been published, a full-scale genome-based analysis is still missing. A database focused on the evolutionary relationships of immune related genes would contribute to and facilitate research on immunology and evolutionary biology.
An Internet resource called ImmTree was constructed for studying the evolution and evolutionary trees of the human immune system. ImmTree contains information about orthologs in 80 species collected from the HomoloGene, OrthoMCL and EGO databases. In addition to phylogenetic trees, the service provides data for the comparison of human-mouse ortholog pairs, including synonymous and non-synonymous mutation rates, Z values, and Ka/Ks quotients. A versatile search engine allows complex queries from the database. Currently, data is available for 847 human immune system related genes and proteins.
ImmTree provides a unique data set of genes and proteins from the human immune system, their phylogenetics, and information for comparisons of human-mouse ortholog pairs, synonymous and non-synonymous mutation rates, as well as other statistical information.
Multiple sequence alignment is the foundation of many important applications in bioinformatics that aim at detecting functionally important regions, predicting protein structures, building phylogenetic trees etc. Although the automatic construction of a multiple sequence alignment for a set of remotely related sequences cause a very challenging and error-prone task, many downstream analyses still rely heavily on the accuracy of the alignments.
To address the need for an objective evaluation framework, we introduce a statistical score that assesses the quality of a given multiple sequence alignment. The quality assessment is based on counting the number of significantly conserved positions in the alignment using importance sampling method in conjunction with statistical profile analysis framework. We first evaluate a novel objective function used in the alignment quality score for measuring the positional conservation. The results for the Src homology 2 (SH2) domain, Ras-like proteins, peptidase M13, subtilase and β-lactamase families demonstrate that the score can distinguish sequence patterns with different degrees of conservation. Secondly, we evaluate the quality of the alignments produced by several widely used multiple sequence alignment programs using a novel alignment quality score and a commonly used sum of pairs method. According to these results, the Mafft strategy L-INS-i outperforms the other methods, although the difference between the Probcons, TCoffee and Muscle is mostly insignificant. The novel alignment quality score provides similar results than the sum of pairs method.
The results indicate that the proposed statistical score is useful in assessing the quality of multiple sequence alignments.
Cells react to changing intra- and extracellular signals by dynamically modulating complex biochemical networks. Cellular responses to extracellular signals lead to changes in gene and protein expression. Since the majority of genes encode proteins, we investigated possible correlations between protein parameters and gene expression patterns to identify proteome-wide characteristics indicative of trends common to expressed proteins.
Numerous bioinformatics methods were used to filter and merge information regarding gene and protein annotations. A new statistical time point-oriented analysis was developed for the study of dynamic correlations in large time series data. The method was applied to investigate microarray datasets for different cell types, organisms and processes, including human B and T cell stimulation, Drosophila melanogaster life span, and Saccharomyces cerevisiae cell cycle.
We show that the properties of proteins synthesized correlate dynamically with the gene expression profile, indicating that not only is the actual identity and function of expressed proteins important for cellular responses but that several physicochemical and other protein properties correlate with gene expression as well. Gene expression correlates strongly with amino acid composition, composition- and sequence-derived variables, functional, structural, localization and gene ontology parameters. Thus, our results suggest that a dynamic relationship exists between proteome properties and gene expression in many biological systems, and therefore this relationship is fundamental to understanding cellular mechanisms in health and disease.
Functional genomics methods are used to investigate the huge amount of information contained in genomes. Numerous experimental methods rely on the use of oligo- or polynucleotides. Nucleotide strand hybridization forms the underlying principle for these methods. For all these techniques, the probes should be unique for analyzed genes. In addition to being unique for the studied genes, the probes should fulfill a large number of criteria to be usable and valid. The criteria include for example, avoidance of self-annealing, suitable melting temperature and nucleotide composition. We developed a method for searching unique and valid oligonucleotides or probes for genes so that there is not even a similar (approximate) occurrence in any other location of the whole genome. By using probe size 25, we analyzed 17 complete genomes representing a wide range of both prokaryotic and eukaryotic organisms. More than 92% of all the genes in the investigated genomes contained valid oligonucleotides. Extensive statistical tests were performed to characterize the properties of unique and valid oligonucleotides. Unique and valid oligonucleotides were relatively evenly distributed in genes except for the beginning and end, which were somewhat overrepresented. The flanking regions in eukaryotes were clearly underrepresented among suitable oligonucleotides. In addition to distributions within genes, the effects on codon and amino acid usage were also studied.
Although biomedical information is growing rapidly, it is difficult to find and retrieve validated data especially for rare hereditary diseases. There is an increased need for services capable of integrating and validating information as well as proving it in a logically organized structure. A XML-based language enables creation of open source databases for storage, maintenance and delivery for different platforms.
Here we present a new data model called fact file and an XML-based specification Inherited Disease Markup Language (IDML), that were developed to facilitate disease information integration, storage and exchange. The data model was applied to primary immunodeficiencies, but it can be used for any hereditary disease. Fact files integrate biomedical, genetic and clinical information related to hereditary diseases.
IDML and fact files were used to build a comprehensive Web and WAP accessible knowledge base ImmunoDeficiency Resource (IDR) available at . A fact file is a user oriented user interface, which serves as a starting point to explore information on hereditary diseases.
The IDML enables the seamless integration and presentation of genetic and disease information resources in the Internet. IDML can be used to build information services for all kinds of inherited diseases. The open source specification and related programs are available at .
Jak tyrosine kinases have a unique domain structure containing a kinase domain (JH1) adjacent to a catalytically inactive pseudokinase domain (JH2). JH2 is crucial for inhibition of basal Jak activity, but the mechanism of this regulation has remained elusive. We show that JH2 negatively regulated Jak2 in bacterial cells, indicating that regulation is an intrinsic property of Jak2. JH2 suppressed basal Jak2 activity by lowering the Vmax of Jak2, whereas JH2 did not affect the Km of Jak2 for a peptide substrate. Three inhibitory regions (IR1–3) within JH2 were identified. IR3 (residues 758–807), at the C terminus of JH2, directly inhibited JH1, suggesting an inhibitory interaction between IR3 and JH1. Molecular modeling of JH2 showed that IR3 could form a stable α-helical fold, supporting that IR3 could independently inhibit JH1. IR2 (725–757) in the C-terminal lobe of JH2, and IR1 (619–670), extending from the N-terminal to the C-terminal lobe, enhanced IR3-mediated inhibition of JH1. Disruption of IR3 either by mutations or a small deletion increased basal Jak2 activity, but abolished interferon-γ–inducible signaling. Together, the results provide evidence for autoinhibition of a Jak family kinase and identify JH2 regions important for autoregulation of Jak2.
The ImmunoDeficiency Resource (IDR), freely available at http://www.uta.fi/imt/bioinfo/idr/, is a comprehensive knowledge base on immunodeficiencies. It is designed for different user groups such as researchers, physicians and nurses as well as patients and their families and the general public. Information on immunodeficiencies is stored as fact files, which are disease- and gene-based information resources. We have developed an inherited disease markup language (IDML) data model, which is designed for storing disease- and gene-specific data in extensible markup language (XML) format. The fact files written by the IDML can be used to present data in different contexts and platforms. All the information in the IDR is validated by expert curators.