The BTB domain is a conserved protein-protein interaction motif. In this study, we identified 56 BTB domain-containing protein genes in the silkworm, in addition to 46 in the honey bee, 55 in the red flour beetle, and 53 in the monarch butterfly. Silkworm BTB protein genes were classified into nine subfamilies according to their domain architecture, and most of them could be mapped on the different chromosomes. Phylogenetic analysis suggests that silkworm BTB protein genes may have undergone a duplication event in three subfamilies: BTB-BACK-Kelch, BTB-BACK-PHR, and BTB-FLYWCH. Comparative analysis demonstrated that the orthologs of each of 13 BTB protein genes present a rigorous orthologous relationship in the silkworm and other surveyed insects, indicating conserved functions of these genes during insect evolution. Furthermore, several silkworm BTB protein genes exhibited sex-specific expression in larval tissues or at different stages during metamorphosis. These findings not only contribute to a better understanding of the evolution of insect BTB protein gene families but also provide a basis for further investigation of the functions of BTB protein genes in the silkworm.
Bismuth selenide (Bi2Se3), a new topological insulator, has attracted much attention in recent years owing to its relatively simple band structure and large bulk band gap. Compared to bulk, few-layer Bi2Se3 is recently considered as a highly promising material. Here, we use a liquid-phase exfoliation method to prepare few-layer Bi2Se3 in N-methyl-2-pyrrolidone or chitosan acetic solution. The resulted few-layer Bi2Se3 dispersion demonstrates an interesting absorption in the visible light region, which is different from bulk Bi2Se3 without any absorption in this region. The absorption spectrum of few-layer Bi2Se3 depends on its size and layer number. At the same time, the nonlinear and saturable absorption of few-layer Bi2Se3 thin film in near infrared is also characterized well and further exploited to generate laser pulses by a passive Q-switching technique. Stable Q-switched operation is achieved with a lower pump threshold of 9.3 mW at 974 nm, pulse energy of 39.8 nJ and a wide range of pulse-repetition-rate from 6.2 to 40.1 kHz. Therefore, the few-layer Bi2Se3 may excite a potential applications in laser photonics and optoelectronic devices.
AIM: To investigate the clinicopathological features of gastric carcinoma in southern China and disease trends changes over the last 18 years.
METHODS: We designed a retrospective study in the Department of Gastrointestinal Surgery, the first affiliated hospital, Sun Yat-sen University. A total of 2100 adult patients with definitely diagnosed, histologically proven gastric carcinomas treated with radical gastrectomy from 1994 to 2013 were examined retrospectively. In all cases patient age, gender, tumor location, Borrmann type, histopathological type and grade, and pTNM stage were identified and recorded. The information was obtained from hospital records. The data were analyzed with Stata12.0 software.
RESULTS: In this study, the mean age of patients was 57 years with a range from 19-89 years. A higher incidence was found in patients over 60 years of age. In the study population, 67.38% of patients were male and 32.62% were female. Women had a higher disease incidence than men in patients less than 40 years of age (P < 0.001). No obvious change of patient age and gender was observed in the last 18 years. The rates of disease by location were the following: antrum (44.57%), followed by fundus/ body (24.95%) and cardia/gastroesophageal junction (23.00%). The mean tumor diameter was 5.57 cm, and advanced gross type Borrmann III was most common. Most patients were at advanced stages when first diagnosed, and patients with early stage disease were relatively rare. More early stage patients were detected in recent years, especially after 2000 (P < 0.001). Gastric carcinoma has different features in young and old patients. The young patients had the following features: more frequently female, tumors in the antrum, larger tumor size, poorly differentiated carcinoma, high rate of metastasis to other sites and advanced stages (P < 0.05).
CONCLUSION: In southern China, gastric carcinoma was more frequent in old men and young women. Young and old patients should be treated differently for having different features.
Gastric carcinoma; Retrospective study; Clinicopathological features; Southern China; Youth
Juvenile hormone (JH) coordinates with 20-hydroxyecdysone (20E) to regulate larval growth and molting in insects. However, little is known about how this cooperative control is achieved during larval stages. Here, we induced silkworm superlarvae by applying the JH analogue (JHA) methoprene and used a microarray approach to survey the mRNA expression changes in response to JHA in the silkworm integument. We found that JHA application significantly increased the expression levels of most genes involved in basic metabolic processes and protein processing and decreased the expression of genes associated with oxidative phosphorylation in the integument. Several key genes involved in the pathways of insulin/insulin-like growth factor signaling (IIS) and 20E signaling were also upregulated after JHA application. Taken together, we suggest that JH may mediate the nutrient-dependent IIS pathway by regulating various metabolic pathways and further modulate 20E signaling.
As the more recent next-generation sequencing (NGS) technologies provide longer read sequences, the use of sequencing datasets for complete haplotype phasing is fast becoming a reality, allowing haplotype reconstruction of a single sequenced genome. Nearly all previous haplotype reconstruction studies have focused on diploid genomes and are rarely scalable to genomes with higher ploidy. Yet computational investigations into polyploid genomes carry great importance, impacting plant, yeast and fish genomics, as well as the studies of the evolution of modern-day eukaryotes and (epi)genetic interactions between copies of genes. In this paper, we describe a novel maximum-likelihood estimation framework, HapTree, for polyploid haplotype assembly of an individual genome using NGS read datasets. We evaluate the performance of HapTree on simulated polyploid sequencing read data modeled after Illumina sequencing technologies. For triploid and higher ploidy genomes, we demonstrate that HapTree substantially improves haplotype assembly accuracy and efficiency over the state-of-the-art; moreover, HapTree is the first scalable polyplotyping method for higher ploidy. As a proof of concept, we also test our method on real sequencing data from NA12878 (1000 Genomes Project) and evaluate the quality of assembled haplotypes with respect to trio-based diplotype annotation as the ground truth. The results indicate that HapTree significantly improves the switch accuracy within phased haplotype blocks as compared to existing haplotype assembly methods, while producing comparable minimum error correction (MEC) values. A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2–5.
While human and other eukaryotic genomes typically contain two copies of every chromosome, plants, yeast and fish such as salmon can have strictly more than two copies of each chromosome. By running standard genotype calling tools, it is possible to accurately identify the number of “wild type” and “mutant” alleles (A, C, G, or T) for each single-nucleotide polymorphism (SNP) site. However, in the case of two heterozygous SNP sites, genotype calling tools cannot determine whether “mutant” alleles from different SNP loci are on the same or different chromosomes. While the former would be healthy, in many cases the latter can cause loss of function; it is therefore necessary to identify the phase—the copies of a chromosome on which the mutant alleles occur—in addition to the genotype. This necessitates efficient algorithms to obtain accurate and comprehensive phase information directly from the next-generation-sequencing read data in higher ploidy species. We introduce an efficient statistical method for this task and show that our method significantly outperforms previous ones, in both accuracy and speed, for phasing triploid and higher ploidy genomes. Our method performs well on human diploid genomes as well, as demonstrated by our improved phasing of the well known NA12878 (1000 Genomes Project).
High-throughput experimental technologies are generating increasingly massive and complex genomic data sets. The sheer enormity and heterogeneity of these data threaten to make the arising problems computationally infeasible. Fortunately, powerful algorithmic techniques lead to software that can answer important biomedical questions in practice. In this Review, we sample the algorithmic landscape, focusing on state-of-the-art techniques, the understanding of which will aid the bench biologist in analysing omics data. We spotlight specific examples that have facilitated and enriched analyses of sequence, transcriptomic and network data sets.
This paper presented a fault diagnosis method for key components of satellite, called Anomaly Monitoring Method (AMM), which is made up of state estimation based on Multivariate State Estimation Techniques (MSET) and anomaly detection based on Sequential Probability Ratio Test (SPRT). On the basis of analysis failure of lithium-ion batteries (LIBs), we divided the failure of LIBs into internal failure, external failure, and thermal runaway and selected electrolyte resistance (R
e) and the charge transfer resistance (R
ct) as the key parameters of state estimation. Then, through the actual in-orbit telemetry data of the key parameters of LIBs, we obtained the actual residual value (R
X) and healthy residual value (R
L) of LIBs based on the state estimation of MSET, and then, through the residual values (R
X and R
L) of LIBs, we detected the anomaly states based on the anomaly detection of SPRT. Lastly, we conducted an example of AMM for LIBs, and, according to the results of AMM, we validated the feasibility and effectiveness of AMM by comparing it with the results of threshold detective method (TDM).
Intramuscular fat (IMF) is an important trait influencing meat quality, and preadipocyte differentiation is a key factor affecting IMF deposition. Here we compared the transcriptome profiles of porcine intramuscular and subcutaneous preadipocytes during differentiation to gain insight into specific molecular and cellular events associated with intramuscular stromal vascular cell (MSVC) differentiation. RNA-Seq was used to screen for differentially expressed genes (DEGs) during the in vitro differentiation of MSVC and subcutaneous stromal vascular cell (ASVC) on days 0, 2 and 4. A total of 985 DEGs were identified during ASVC differentiation and 1469 DEGs during MSVC differentiation. Among these DEGs, 409 genes were specifically expressed during ASVC differentiation, 893 genes were specifically expressed during MSVC differentiation, and 576 DEGs were co-expressed during ASVC and MSVC differentiation. The expression profiles of DEGs during ASVC or MSVC differentiation were determined by cluster analysis based on Short Time-series Expression Miner (STEM). Four significant STEM profiles (profiles 1, 4, 5, and 14) were determined during ASVC differentiation, and four significant STEM profiles (profiles 1, 4, 11, and 14) were determined during MSVC differentiation. Gene ontology (GO) analysis indicated that DEGs related to adipocyte differentiation were identified to be significantly enriched in both adipose and muscle profile 14. In addition, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis of DEGs in adipose profile 14 and muscle profiles 11 and 14 (STEM clustered them into one cluster) showed that the PPAR signaling pathway was significantly enriched in these profiles and four signaling pathways were specifically enriched in muscle profiles 11 and 14. Furthermore, analysis of transcription factor binding sites (TFBS) in the gene set revealed two over-represented transcription factors (NR3C4 and NR3C1), which were specifically significantly enriched in the promoter regions of genes within muscle gene expression profiles 11 and 14.
Cytotoxic-T-lymphocyte (CTL) escape mutations undermine the durability of effective human immunodeficiency virus type 1 (HIV-1)-specific CD8+ T cell responses. The rate of CTL escape from a given response is largely governed by the net of all escape-associated viral fitness costs and benefits. The observation that CTL escape mutations can carry an associated fitness cost in terms of reduced virus replication capacity (RC) suggests a fitness cost-benefit trade-off that could delay CTL escape and thereby prolong CD8 response effectiveness. However, our understanding of this potential fitness trade-off is limited by the small number of CTL escape mutations for which a fitness cost has been quantified. Here, we quantified the fitness cost of the 29 most common HIV-1B Gag CTL escape mutations using an in vitro RC assay. The majority (20/29) of mutations reduced RC by more than the benchmark M184V antiretroviral drug resistance mutation, with impacts ranging from 8% to 69%. Notably, the reduction in RC was significantly greater for CTL escape mutations associated with protective HLA class I alleles than for those associated with nonprotective alleles. To speed the future evaluation of CTL escape costs, we also developed an in silico approach for inferring the relative impact of a mutation on RC based on its computed impact on protein thermodynamic stability. These data illustrate that the magnitude of CTL escape-associated fitness costs, and thus the barrier to CTL escape, varies widely even in the conserved Gag proteins and suggest that differential escape costs may contribute to the relative efficacy of CD8 responses.
The study was conducted to investigate whether dietary fish oil could influence growth of piglets via regulating the expression of proinflammatory cytokines. A split-plot experimental design was used with sow diet effect in the main plots and differing piglet diet effect in the subplot. The results showed that suckling piglets from fish oil fed dams grew rapidly (P < 0.05) than control. It was also observed that these piglets had higher ADG, feed intake, and final body weight (P < 0.05) during postweaning than those piglets from lard fed dams. Furthermore, there was a significant decrease (P < 0.01) in the expression of interleukin 6 and tumor necrosis factor-α in longissimus dorsi muscle. In contrast, there was a tendency (P < 0.10) towards lower ADG and higher feed : gain in weaned piglets receiving fish oil compared with those receiving lard. Meanwhile, splenic proinflammatory cytokines expression was increased (P < 0.01) in piglets receiving fish oil during postweaning period. The results suggested that 7% fish oil addition to sows' diets alleviated inflammatory response via decreasing the proinflammatory cytokines expression in skeletal muscle and accelerated piglet growth. However, 7% fish oil addition to weaned piglets' diets might decrease piglet growth via increasing splenic proinflammatory cytokines expression.
Motivation: The exponential growth of protein sequence databases has
increasingly made the fundamental question of searching for homologs a computational
bottleneck. The amount of unique data, however, is not growing nearly as fast; we can
exploit this fact to greatly accelerate homology search. Acceleration of programs in the
popular PSI/DELTA-BLAST family of tools will not only speed-up homology search directly
but also the huge collection of other current programs that primarily interact with large
protein databases via precisely these tools.
Results: We introduce a suite of homology search tools, powered by
compressively accelerated protein BLAST (CaBLASTP), which are significantly faster than
and comparably accurate with all known state-of-the-art tools, including HHblits,
DELTA-BLAST and PSI-BLAST. Further, our tools are implemented in a manner that allows
direct substitution into existing analysis pipelines. The key idea is that we introduce a
local similarity-based compression scheme that allows us to operate directly on the
compressed data. Importantly, CaBLASTP’s runtime scales almost linearly in the
amount of unique data, as opposed to current BLASTP variants, which scale linearly in the
size of the full protein database being searched. Our compressive algorithms will speed-up
many tasks, such as protein structure prediction and orthology mapping, which rely heavily
on homology search.
Availability: CaBLASTP is available under the GNU Public License at
Hepatitis B virus(HBV) infection remains a global problem, despite the effectiveness of the Hepatitis B vaccine in preventing infection. The resolution of Hepatitis B virus infection has been believed to be attributable to virus-specific immunity. In vivo direct evaluation of anti-HBV immunity in the liver is currently not possible. We have developed a new assay system that detects HBV clearance in the liver after the hydrodynamic transfer of a reporter gene and over-length, linear HBV DNA into hepatocytes, followed by bioluminescence imaging of the reporter gene (Fluc). We employed bioluminescence detection of luciferase expression in HBV-infected hepatocytes to measure the Hepatitis B core antigen (HBcAg)-specific immune responses directed against these infected hepatocytes. Only HBcAg-immunized, but not mock-treated, animals decreased the amounts of luciferase expression, HBsAg and viral DNA from the liver at day 28 after hydrodynamic infection with over-length HBV DNA, indicating that control of luciferase expression correlates with viral clearance from infected hepatocytes.
Protein structure alignment is a fundamental problem in computational structure biology. Many programs have been developed for automatic protein structure alignment, but most of them align two protein structures purely based upon geometric similarity without considering evolutionary and functional relationship. As such, these programs may generate structure alignments which are not very biologically meaningful from the evolutionary perspective. This paper presents a novel method DeepAlign for automatic pairwise protein structure alignment. DeepAlign aligns two protein structures using not only spatial proximity of equivalent residues (after rigid-body superposition), but also evolutionary relationship and hydrogen-bonding similarity. Experimental results show that DeepAlign can generate structure alignments much more consistent with manually-curated alignments than other automatic tools especially when proteins under consideration are remote homologs. These results imply that in addition to geometric similarity, evolutionary information and hydrogen-bonding similarity are essential to aligning two protein structures.
Despite significant progress in recent years, ab initio folding is still one of the most challenging problems in structural biology. This paper presents a probabilistic graphical model for ab initio folding, which employs Conditional Random Fields (CRFs) and directional statistics to model the relationship between the primary sequence of a protein and its three-dimensional structure. Different from the widely-used fragment assembly method and the lattice model for protein folding, our graphical model can explore protein conformations in a continuous space according to their probability. The probability of a protein conformation reflects its stability and is estimated from PSI-BLAST sequence profile and predicted secondary structure. Experimental results indicate that this new method compares favorably with the fragment assembly method and the lattice model.
protein structure prediction; ab initio folding; conditional random fields (CRFs); directional statistics; fragment assembly; lattice model
Coat color in dog breeds is an excellent character for revealing the power of artificial selection, as it is extremely diverse and likely the result of recent domestication. Coat color is generated by melanocytes, which synthesize pheomelanin (a red or yellow pigment) or eumelanin (a black or brown pigment) through the pigment type-switching pathway, and is regulated by three genes in dogs: MC1R (melanocortin receptor 1), CBD103 (β-defensin 103), and ASIP (agouti-signaling protein precursor). The genotypes of these three gene loci in dog breeds are associated with coat color pattern. Here, we resequenced these three gene loci in two Kunming dog populations and analyzed these sequences using population genetic approaches to identify evolutionary patterns that have occurred at these loci during the recent domestication and breeding of the Kunming dog. The analysis showed that MC1R undergoes balancing selection in both Kunming dog populations, and that the Fst value for MC1R indicates significant genetic differentiation across the two populations. In contrast, similar results were not observed for CBD103 or ASIP. These results suggest that high heterozygosity and allelic differences at the MC1R locus may explain both the mixed color coat, of yellow and black, and the difference in coat colors in both Kunming dog populations.
Small molecules that can specifically bind to a DNA abasic site (AP site) have received much attention due to their importance in DNA lesion identification, drug discovery, and sensor design. Herein, the AP site binding behavior of sanguinarine (SG), a natural alkaloid, was investigated. In aqueous solution, SG has a short-wavelength alkanolamine emission band and a long-wavelength iminium emission band. At pH 8.3, SG experiences a fluorescence quenching for both bands upon binding to fully matched DNAs without the AP site, while the presence of the AP site induces a strong SG binding and the observed fluorescence enhancement for the iminium band are highly dependent on the nucleobases flanking the AP site, while the alkanolamine band is always quenched. The bases opposite the AP site also exert some modifications on the SG's emission behavior. It was found that the observed quenching for DNAs with Gs and Cs flanking the AP site is most likely caused by electron transfer between the AP site-bound excited-state SG and the nearby Gs. However, the flanking As and Ts that are not easily oxidized favor the enhanced emission. This AP site-selective enhancement of SG fluorescence accompanies a band conversion in the dominate emission from the alkanolamine to iminium band thus with a large emission shift of about 170 nm. Absorption spectra, steady-state and transient-state fluorescence, DNA melting, and electrolyte experiments confirm that the AP site binding of SG occurs and the stacking interaction with the nearby base pairs is likely to prevent the converted SG iminium form from contacting with water that is thus emissive when the AP site neighbors are bases other than guanines. We expect that this fluorophore would be developed as a promising AP site binder having a large emission shift.
This paper presents RaptorX, a statistical method for template-based protein modeling that improves alignment accuracy by exploiting structural information in a single or multiple templates. RaptorX consists of three major components: single-template threading, alignment quality prediction and multiple-template threading. This paper summarizes the methods employed by RaptorX and presents its CASP9 result analysis, aiming to identify major bottlenecks with RaptorX and template-based modeling and hopefully directions for further study. Our results show that template structural information helps a lot with both single-template and multiple-template protein threading especially when closely-related templates are unavailable and there is still large room for improvement in both alignment and template selection. The RaptorX web server is available at http://raptorx.uchicago.edu.
single-template threading; multiple-template threading; alignment quality prediction; probabilistic alignment; multiple protein alignment; CASP
The MyD88-independent pathway, one of the two crucial TLR signaling routes, is thought to be a vertebrate innovation. However, a novel Toll/interleukin-1 receptor (TIR) adaptor, designated bbtTICAM, which was identified in the basal chordate amphioxus, links this pathway to invertebrates. The protein architecture of bbtTICAM is similar to that of vertebrate TICAM1 (TIR-containing adaptor molecule-1, also known as TRIF), while phylogenetic analysis based on the TIR domain indicated that bbtTICAM is the oldest ortholog of vertebrate TICAM1 and TICAM2 (TIR-containing adaptor molecule-2, also known as TRAM). Similar to human TICAM1, bbtTICAM activates NF-κB in a MyD88-independent manner by interacting with receptor interacting protein (RIP) via its RHIM motif. Such activation requires bbtTICAM to form homodimers in endosomes, and it may be negatively regulated by amphioxus SARM (sterile α and armadillo motif-containing protein) and TRAF2. However, bbtTICAM did not induce the production of type I interferon. Thus, our study not only presents the ancestral features of vertebrate TICAM1 and TICAM2, but also reveals the evolutionary origin of the MyD88-independent pathway from basal chordates, which will aid in understanding the development of the vertebrate TLR network.
TLR; TICAM; MyD88-independent pathway; innate immunity; evolution
Motivation: Building an accurate alignment of a large set of distantly related protein structures is still very challenging.
Results: This article presents a novel method 3DCOMB that can generate a multiple structure alignment (MSA) with not only as many conserved cores as possible, but also high-quality pairwise alignments. 3DCOMB is unique in that it makes use of both local and global structure environments, combined by a statistical learning method, to accurately identify highly similar fragment blocks (HSFBs) among all proteins to be aligned. By extending the alignments of these HSFBs, 3DCOMB can quickly generate an accurate MSA without using progressive alignment. 3DCOMB significantly excels others in aligning distantly related proteins. 3DCOMB can also generate correct alignments for functionally similar regions among proteins of very different structures while many other MSA tools fail. 3DCOMB is useful for many real-world applications. In particular, it enables us to find out that there is still large improvement room for multiple template homology modeling while several other MSA tools fail to do so.
Availability: 3DCOMB is available at http://ttic.uchicago.edu/~jinbo/software.htm.
Supplementary Information: Supplementary data are available at Bioinformatics online.
Improving the quality and coverage of the protein interactome is of tantamount importance for biomedical research, particularly given the various sources of uncertainty in high-throughput techniques. We introduce a structure-based framework, Coev2Net, for computing a single confidence score that addresses both false-positive and false-negative rates. Coev2Net is easily applied to thousands of binary protein interactions and has superior predictive performance over existing methods. We experimentally validate selected high-confidence predictions in the human MAPK network and show that predicted interfaces are enriched for cancer -related or damaging SNPs. Coev2Net can be downloaded at http://struct2net.csail.mit.edu.
Background & Aims
Cholestasis contributes to hepatocellular injury and promotes liver carcinogenesis. We created a mouse model of chronic cholestasis to study its effects on progression of cholangiocarcinoma and the oncogenes involved.
To induce chronic cholestasis, Balb/c mice were given 2 weekly intraperitoneal injections of diethylnitrosamine (DEN); 2 weeks later, some mice also received left and median bile duct ligation (LMBDL), and then 1 week later, were fed DEN, in corn oil, weekly by oral gavage (DLD). Liver samples were analyzed by immunohistochemical and biochemical assays; expression of Mnt and c-Myc were reduced by injection of small inhibitor RNAs.
Chronic cholestasis was induced by DLD and accelerated progression of cholangiocarcinoma, compared with mice given only DEN. Cystic hyperplasias, cystic atypical hyperplasias, cholangiomas, and cholangiocarcinoma developed in the DLD group at weeks 8, 12, 16 and 28, respectively. LMBDL repressed expression of microRNA (miR)-34a and Let-7a, upregulating Lin-28B, HIF-1α, HIF-2α, and miR-210. Upregulation of Lin-28B might inhibit let-7a, which is associated with development of cystic hyperplasias, cystic atypical hyperplasias, cholangiomas, and cholangiocarcinoma. Knockdown of c-Myc reduced progression of cholangiocarcinoma whereas knockdown of Mnt accelerated its progression. Downregulation of miR-34a expression might upregulate c-Myc. The upregulation of miR-210 via HIF-2α was involved in downregulation of Mnt. Activation of the miR-34a–c-Myc and HIF-2α–miR-210–Mnt pathways caused c-Myc to bind the E-box element of cyclin D1, instead of Mnt, resulting in cyclin D1 upregulation.
DLD induction of chronic cholestasis accelerated progression of cholangiocarcinoma, which is mediated by downregulation of miR-34a, upregulation miR-210, and replacement of Mnt by c-Myc in binding to cyclin D1.
Liver disease; transcriptional regulation; cell cycle; carcinogenesis; DNA binding
Motivation: Alignment errors are still the main bottleneck for current template-based protein modeling (TM) methods, including protein threading and homology modeling, especially when the sequence identity between two proteins under consideration is low (<30%).
Results: We present a novel protein threading method, CNFpred, which achieves much more accurate sequence–template alignment by employing a probabilistic graphical model called a Conditional Neural Field (CNF), which aligns one protein sequence to its remote template using a non-linear scoring function. This scoring function accounts for correlation among a variety of protein sequence and structure features, makes use of information in the neighborhood of two residues to be aligned, and is thus much more sensitive than the widely used linear or profile-based scoring function. To train this CNF threading model, we employ a novel quality-sensitive method, instead of the standard maximum-likelihood method, to maximize directly the expected quality of the training set. Experimental results show that CNFpred generates significantly better alignments than the best profile-based and threading methods on several public (but small) benchmarks as well as our own large dataset. CNFpred outperforms others regardless of the lengths or classes of proteins, and works particularly well for proteins with sparse sequence profiles due to the effective utilization of structure information. Our methodology can also be adapted to protein sequence alignment.
Supplementary data are available at Bioinformatics online.
Most threading methods predict the structure of a protein using only a single template. Due to the increasing number of solved structures, a protein without solved structure is very likely to have more than one similar template structures. Therefore, a natural question to ask is if we can improve modeling accuracy using multiple templates. This paper describes a new multiple-template threading method to answer this question. At the heart of this multiple-template threading method is a novel probabilistic-consistency algorithm that can accurately align a single protein sequence simultaneously to multiple templates. Experimental results indicate that our multiple-template method can improve pairwise sequence-template alignment accuracy and generate models with better quality than single-template models even if they are built from the best single templates (P-value<10-6) while many popular multiple sequence/structure alignment tools fail to do so. The underlying reason is that our probabilistic-consistency algorithm can generate accurate multiple sequence/template alignments. In another word, without an accurate multiple sequence/template alignment the modeling accuracy cannot be improved by simply using multiple templates to increase alignment coverage. Blindly tested on the CASP9 targets with more than one good template structures, our method outperforms all other CASP9 servers except two (Zhang-Server and QUARK of the same group). Our probabilistic-consistency algorithm can possibly be extended to align multiple protein/RNA sequences and structures.
protein modeling; multiple-template threading; probabilistic alignment matrix; probabilistic-consistency algorithm; multiple sequence/template alignment
Compared with the protein 3-class secondary structure (SS) prediction, the 8-class prediction gains less attention and is also much more challenging, especially for proteins with few sequence homologs. This paper presents a new probabilistic method for 8-class SS prediction using conditional neural fields (CNFs), a recently invented probabilistic graphical model. This CNF method not only models the complex relationship between sequence features and SS, but also exploits the interdependency among SS types of adjacent residues. In addition to sequence profiles, our method also makes use of non-evolutionary information for SS prediction. Tested on the CB513 and RS126 data sets, our method achieves Q8 accuracy of 64.9 and 64.7%, respectively, which are much better than the SSpro8 web server (51.0 and 48.0%, respectively). Our method can also be used to predict other structure properties (e.g. solvent accessibility) of a protein or the SS of RNA.
Bioinformatics; Conditional neural fields; Eight class; Protein; Secondary structure prediction
Protein threading is one of the most successful protein structure prediction methods. Most protein threading methods use a scoring function linearly combining sequence and structure features to measure the quality of a sequence-template alignment so that a dynamic programming algorithm can be used to optimize the scoring function. However, a linear scoring function cannot fully exploit interdependency among features and thus, limits alignment accuracy.
This paper presents a nonlinear scoring function for protein threading, which not only can model interactions among different protein features, but also can be efficiently optimized using a dynamic programming algorithm. We achieve this by modeling the threading problem using a probabilistic graphical model Conditional Random Fields (CRF) and training the model using the gradient tree boosting algorithm. The resultant model is a nonlinear scoring function consisting of a collection of regression trees. Each regression tree models a type of nonlinear relationship among sequence and structure features. Experimental results indicate that this new threading model can effectively leverage weak biological signals and improve both alignment accuracy and fold recognition rate greatly.
protein threading; conditional random fields; gradient tree boosting; regression tree; nonlinear scoring function