Cancer cells derived from different stages of tumor progression may exhibit distinct biological properties, as exemplified by the paired lung cancer cell lines H1993 and H2073. While H1993 was derived from chemo-naive metastasized tumor, H2073 originated from the chemo-resistant primary tumor from the same patient and exhibits strikingly different drug response profile. To understand the underlying genetic and epigenetic bases for their biological properties, we investigated these cells using a wide range of large-scale methods including whole genome sequencing, RNA sequencing, SNP array, DNA methylation array, and de novo genome assembly. We conducted an integrative analysis of both cell lines to distinguish between potential driver and passenger alterations. Although many genes are mutated in these cell lines, the combination of DNA- and RNA-based variant information strongly implicates a small number of genes including TP53 and STK11 as likely drivers. Likewise, we found a diverse set of genes differentially expressed between these cell lines, but only a fraction can be attributed to changes in DNA copy number or methylation. This set included the ABC transporter ABCC4, implicated in drug resistance, and the metastasis associated MET oncogene. While the rich data content allowed us to reduce the space of hypotheses that could explain most of the observed biological properties, we also caution there is a lack of statistical power and inherent limitations in such single patient case studies.
Hepatocellular carcinoma (HCC) is a heterogeneous disease with high mortality rate. Recent genomic studies have identified TP53, AXIN1, and CTNNB1 as the most frequently mutated genes. Lower frequency mutations have been reported in ARID1A, ARID2 and JAK1. In addition, hepatitis B virus (HBV) integrations into the human genome have been associated with HCC.
Here, we deep-sequence 42 HCC patients with a combination of whole genome, exome and transcriptome sequencing to identify the mutational landscape of HCC using a reasonably large discovery cohort. We find frequent mutations in TP53, CTNNB1 and AXIN1, and rare but likely functional mutations in BAP1 and IDH1. Besides frequent hepatitis B virus integrations at TERT, we identify translocations at the boundaries of TERT. A novel deletion is identified in CTNNB1 in a region that is heavily mutated in multiple cancers. We also find multiple high-allelic frequency mutations in the extracellular matrix protein LAMA2. Lower expression levels of LAMA2 correlate with a proliferative signature, and predict poor survival and higher chance of cancer recurrence in HCC patients, suggesting an important role of the extracellular matrix and cell adhesion in tumor progression of a subgroup of HCC patients.
The heterogeneous disease of HCC features diverse modes of genomic alteration. In addition to common point mutations, structural variations and methylation changes, there are several virus-associated changes, including gene disruption or activation, formation of chimeric viral-human transcripts, and DNA copy number changes. Such a multitude of genomic events likely contributes to the heterogeneous nature of HCC.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-014-0436-9) contains supplementary material, which is available to authorized users.
Allele-specific gene expression, ASE, is an important aspect of gene regulation. We developed a novel method MBASED, meta-analysis based allele-specific expression detection for ASE detection using RNA-seq data that aggregates information across multiple single nucleotide variation loci to obtain a gene-level measure of ASE, even when prior phasing information is unavailable. MBASED is capable of one-sample and two-sample analyses and performs well in simulations. We applied MBASED to a panel of cancer cell lines and paired tumor-normal tissue samples, and observed extensive ASE in cancer, but not normal, samples, mainly driven by genomic copy number alterations.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-014-0405-3) contains supplementary material, which is available to authorized users.
Receptor interacting protein kinase 4 (RIPK4) is required for epidermal differentiation (1–4) and is mutated in Bartsocas-Papas syndrome (5, 6). While RIPK4 binds protein kinase C (5, 6), RIPK4 signaling mechanisms are largely unknown. We show that ectopic RIPK4 induces cytosolic β-catenin accumulation and a transcriptional program similar to Wnt3a, whereas kinase-defective or Bartsocas-Papas syndrome RIPK4 mutants do not. Ectopic ripk4 synergized with Wnt family member xwnt8 in Xenopus, whereas ripk4 morpholinos or kinase-defective RIPK4 antagonized Wnt signaling. Mechanistically, RIKP4 interacted constitutively with the Wnt adaptor protein DVL2 and, after Wnt3a stimulation, with the co-receptor LRP6. Phosphorylation of DVL2 at Ser298 and Ser480 by RIPK4 favored canonical Wnt signaling. Growth of a Wnt-dependent N-Tera2 xenograft tumor model was suppressed by RIPK4 knockdown, suggesting that RIPK4 overexpression may contribute to the growth of certain tumor types.
LRP6; β-catenin; Xenopus
HIV-1 infection enhances HCV replication and as a consequence accelerates HCV-mediated hepatocellular carcinoma (HCC). However, the precise molecular mechanism by which this takes place is currently unknown. Our data showed that infectious HIV-1 failed to replicate in human hepatocytic cell lines. No discernible virus replication was observed, even when the cell lines transfected with HIV-1 proviral DNA were co-cultured with Jurkat T cells, indicating that the problem of liver deterioration in the co-infected patient is not due to the replication of HIV-1 in the hepatocytes of the HCV infected host. Instead, HIV-1 Nef protein was transferred from nef-expressing T cells to hepatocytic cells through conduits, wherein up to 16% (average 10%) of the cells harbored the transferred Nef, when the hepatocytic cells were co-cultured with nef-expressing Jurkat cells for 24 h. Further, Nef altered the size and numbers of lipid droplets (LD), and consistently up-regulated HCV replication by 1.5∼2.5 fold in the target subgenomic replicon cells, which is remarkable in relation to the initially indolent viral replication. Nef also dramatically augmented reactive oxygen species (ROS) production and enhanced ethanol-mediated up-regulation of HCV replication so as to accelerate HCC. Taken together, these data indicate that HIV-1 Nef is a critical element in accelerating progression of liver pathogenesis via enhancing HCV replication and coordinating modulation of key intra- and extra-cellular molecules for liver decay.
Loss of Asxl1 results in myelodysplastic syndrome, whereas concomitant deletion of Tet2 restores HSC self-renewal and triggers a more severe disease phenotype distinct from that seen in single-gene knockout mice.
Somatic Addition of Sex Combs Like 1 (ASXL1) mutations occur in 10–30% of patients with myeloid malignancies, most commonly in myelodysplastic syndromes (MDSs), and are associated with adverse outcome. Germline ASXL1 mutations occur in patients with Bohring-Opitz syndrome. Here, we show that constitutive loss of Asxl1 results in developmental abnormalities, including anophthalmia, microcephaly, cleft palates, and mandibular malformations. In contrast, hematopoietic-specific deletion of Asxl1 results in progressive, multilineage cytopenias and dysplasia in the context of increased numbers of hematopoietic stem/progenitor cells, characteristic features of human MDS. Serial transplantation of Asxl1-null hematopoietic cells results in a lethal myeloid disorder at a shorter latency than primary Asxl1 knockout (KO) mice. Asxl1 deletion reduces hematopoietic stem cell self-renewal, which is restored by concomitant deletion of Tet2, a gene commonly co-mutated with ASXL1 in MDS patients. Moreover, compound Asxl1/Tet2 deletion results in an MDS phenotype with hastened death compared with single-gene KO mice. Asxl1 loss results in a global reduction of H3K27 trimethylation and dysregulated expression of known regulators of hematopoiesis. RNA-Seq/ChIP-Seq analyses of Asxl1 in hematopoietic cells identify a subset of differentially expressed genes as direct targets of Asxl1. These findings underscore the importance of Asxl1 in Polycomb group function, development, and hematopoiesis.
The precedence effect is a prerequisite for faithful sound localization in a complex auditory environment, and is a physiological phenomenon in which the auditory system selectively suppresses the directional information from echoes. Here we investigated how neurons in the inferior colliculus respond to the paired sounds that produce precedence-effect illusions, and whether their firing behavior can be modulated through inhibition with gamma-aminobutyric acid (GABA). We recorded extracellularly from 36 neurons in rat inferior colliculus under three conditions: no injection, injection with saline, and injection with gamma-aminobutyric acid. The paired sounds that produced precedence effects were two identical 4-ms noise bursts, which were delivered contralaterally or ipsilaterally to the recording site. The normalized neural responses were measured as a function of different inter-stimulus delays and half-maximal interstimulus delays were acquired. Neuronal responses to the lagging sounds were weak when the inter-stimulus delay was short, but increased gradually as the delay was lengthened. Saline injection produced no changes in neural responses, but after local gamma-aminobutyric acid application, responses to the lagging stimulus were suppressed. Application of gamma-aminobutyric acid affected the normalized response to lagging sounds, independently of whether they or the paired sounds were contralateral or ipsilateral to the recording site. These observations suggest that local inhibition by gamma-aminobutyric acid in the rat inferior colliculus shapes the neural responses to lagging sounds, and modulates the precedence effect.
nerve regeneration; precedence effect; auditory center; inferior colliculus; gamma-aminobutyric acid; local inhibition; echo suppression; lagging stimulus; NSFC grant; neural regeneration
Neither HBV DNA nor HBsAg positivity at birth is an accurate marker for HBV infection of infants. No data is available for continuous changes of HBV markers in newborns to HBsAg(+) mothers. This prospective, multi-centers study aims at observing the dynamic changes of HBV markers and exploring an early diagnostic marker for mother-infant infection.
One hundred forty-eight HBsAg(+) mothers and their newborns were enrolled after mothers signed the informed consent forms. Those infants were received combination immunoprophylaxis (hepatitis B immunoglobulin [HBIG] and hepatitis B vaccine) at birth, and then followed up to 12 months. Venous blood of the infants (0, 1, 7, and 12 months of age) was collected to test for HBV DNA and HBV markers.
Of the 148 infants enrolled in our study, 41 and 24 infants were detected as HBsAg(+) and HBV DNA(+) at birth, respectively. Nine were diagnosed with HBV infection after 7 mo follow-up. Dynamic observation of the HBV markers showed that HBV DNA and HBsAg decreased gradually and eventually sero-converted to negativity in the non-infected infants, whereas in the infected infants, HBV DNA and HBsAg were persistently positive, or higher at the end of follow-up. At 1 mo, the infants with anti-HBs(+), despite positivity for HBsAg or HBV DNA at birth, were resolved after 12 mo follow-up, whereas all the nine infants with anti-HBs(−) were diagnosed with HBV infection. Anti-HBs(−) at 1 mo showed a higher positive likelihood ratio for HBV mother-infant infection than HBV DNA and/or HBsAg at birth.
Negativity for anti-HBs at 1 mo can be considered as a sensitive and early diagnostic indictor for HBV infection in the infants with positive HBV DNA and HBsAg at birth, especially for those infants with low levels of HBV DNA load and HBsAg titer.
Many large-scale studies analyzed high-throughput genomic data to identify altered pathways essential to the development and progression of specific types of cancer. However, no previous study has been extended to provide a comprehensive analysis of pathways disrupted by copy number alterations across different human cancers. Towards this goal, we propose a network-based method to integrate copy number alteration data with human protein-protein interaction networks and pathway databases to identify pathways that are commonly disrupted in many different types of cancer.
We applied our approach to a data set of 2,172 cancer patients across 16 different types of cancers, and discovered a set of commonly disrupted pathways, which are likely essential for tumor formation in majority of the cancers. We also identified pathways that are only disrupted in specific cancer types, providing molecular markers for different human cancers. Analysis with independent microarray gene expression datasets confirms that the commonly disrupted pathways can be used to identify patient subgroups with significantly different survival outcomes. We also provide a network view of disrupted pathways to explain how copy number alterations affect pathways that regulate cell growth, cycle, and differentiation for tumorigenesis.
In this work, we demonstrated that the network-based integrative analysis can help to identify pathways disrupted by copy number alterations across 16 types of human cancers, which are not readily identifiable by conventional overrepresentation-based and other pathway-based methods. All the results and source code are available at http://compbio.cs.umn.edu/NetPathID/.
The ribosome consists of small and large subunits each comprised of dozens of proteins and RNA molecules. However, the functions of many of the individual protomers within the ribosome are still unknown. Here we describe the solution NMR structure of the ribosomal protein RP-L35Ae from the archaeon Pyrococcus furiosus. RP-L35Ae is buried within the large subunit of the ribosome and belongs to Pfam protein domain family PF01247, which is highly conserved in eukaryotes, present in a few archaeal genomes, but absent in bacteria. The protein adopts a six-stranded anti-parallel β-barrel analogous to the ‘tRNA binding motif’ fold. The structure of the P. furiosus RP-L35Ae presented here constitutes the first structural representative from this protein domain family.
ribosomal protein; L35Ae; PF01247; tRNA binding; solution NMR; structural genomics
In this experiment, 97 patients with obstructive sleep apnea hypopnea syndrome were divided into three groups (mild, moderate, severe) according to minimum oxygen saturation, and 35 healthy subjects were examined as controls. Cognitive function was determined using the mismatch negativity paradigm and the Montreal Cognitive Assessment. The results revealed that as the disease worsened, the mismatch negativity latency was gradually extended, and the amplitude gradually declined in patients with obstructive sleep apnea hypopnea syndrome. Importantly, mismatch negativity latency in severe patients with a persistent time of minimum oxygen saturation < 60 seconds was significantly shorter than that with a persistent time of minimum oxygen saturation > 60 seconds. Correlation analysis revealed a negative correlation between minimum oxygen saturation latency and Montreal Cognitive Assessment scores. These findings indicate that intermittent night-time hypoxemia affects mismatch negativity waveforms and Montreal Cognitive Assessment scores. As indicators for detecting the cognitive functional status of obstructive sleep apnea hypopnea syndrome patients, the sensitivity of mismatch negativity is 82.93%, the specificity is 73.33%, the accuracy rate is 81.52%, the positive predictive value is 85.00%, the negative predictive value is 70.21%, the positive likelihood ratio is 3, and the negative likelihood ratio is 0.23. These results indicate that mismatch negativity can be used as an effective tool for diagnosis of cognitive dysfunction in obstructive sleep apnea hypopnea syndrome patients.
obstructive sleep apnea hypopnea syndrome; mismatch negativity; cognitive function; Montreal Cognitive Assessment; latency; diagnosis
The echinoderm microtubule-associated protein-like 4-anaplastic lymphoma kinase (EML4-ALK) fusion gene resulting from an inversion within chromosome 2p occurs in approximately 5% of non-small cell lung cancer and is mutually exclusive with Ras and EGFR mutations. In this study, we have used a potent and selective ALK small molecule inhibitor, NPV-TAE684, to assess the oncogenic role of EML4-ALK in non-small cell lung cancer (NSCLC). We show here that TAE684 inhibits proliferation and induces cell cycle arrest, apoptosis, and tumor regression in two NSCLC models that harbor EML4-ALK fusions. TAE684 inhibits EML4-ALK activation and its downstream signaling including ERK, AKT, and STAT3. We used microarray analysis to carry out targeted pathway studies of gene expression changes in H2228 NSCLC xenograft model after TAE684 treatment and identified a gene signature of EML4-ALK inhibition. The gene signature represents 1210 known human genes, and the top biologic processes represented by these genes are cell cycle, DNA synthesis, cell proliferation, and cell death. We also compared the effect of TAE684 with PF2341066, a c-Met and ALK small molecule inhibitor currently in clinical trial in cancers harboring ALK fusions, and demonstrated that TAE684 is a much more potent inhibitor of EML4-ALK. Our data demonstrate that EML4-ALK plays an important role in the pathogenesis of a subset of NSCLC and provides insight into the mechanism of EML4-ALK inhibition by a small molecule inhibitor.
For cell regulation, E2-like ubiquitin-fold modifier conjugating enzyme 1 (Ufc1) is involved in the transfer of ubiquitin-fold modifier 1 (Ufm1), a ubiquitin like protein which is activated by E1-like enzyme Uba5, to various target proteins. Thereby, Ufc1 participates in the very recently discovered Ufm1-Uba5-Ufc1 ubiquination pathway which is found in metazoan organisms. The structure of human Ufc1 was solved by using both NMR spectroscopy and X-ray crystallography. The complementary insights obtained with the two techniques provided a unique basis for understanding the function of Ufc1 at atomic resolution. The Ufc1 structure consists of the catalytic core domain conserved in all E2-like enzymes and an additional N-terminal helix. The active site Cys116, which forms a thio-ester bond with Ufm1, is located in a flexible loop that is highly solvent accessible. Based on the Ufc1 and Ufm1 NMR structures, a model could be derived for the Ufc1-Ufm1 complex in which the C-terminal Gly83 of Ufm1 may well form the expected thio-ester with Cys116, suggesting that Ufm1-Ufc1 functions as described for other E1-E2-E3 machineries. α-helix 1 of Ufc1 adopts different conformations in the crystal and in solution, suggesting that this helix plays a key role to mediate specificity.
Ufc1; Ufm1; Ubiquitin; E2; Ubiquitin Conjugating Enzyme
Crystallization has proven to be the most significant bottleneck to high-throughput protein structure determination using diffraction methods. We have used the large-scale, systematically generated experimental results of the Northeast Structural Genomics Consortium to characterize the biophysical properties that control protein crystallization. Datamining of crystallization results combined with explicit folding studies lead to the conclusion that crystallization propensity is controlled primarily by the prevalence of well-ordered surface epitopes capable of mediating interprotein interactions and is not strongly influenced by overall thermodynamic stability. These analyses identify specific sequence features correlating with crystallization propensity that can be used to estimate the crystallization probability of a given construct. Analyses of entire predicted proteomes demonstrate substantial differences in the bulk amino acid sequence properties of human versus eubacterial proteins that reflect likely differences in their biophysical properties including crystallization propensity. Finally, our thermodynamic measurements enable critical evaluation of previous claims regarding correlations between protein stability and bulk sequence properties, which generally are not supported by our dataset.
protein crystallization; protein thermodynamics; crystallization mechanism; surface entropy; datamining; structural genomics
Protein ubiquitination provides an efficient and reversible mechanism to regulate cell cycle progression and checkpoint control. Numerous regulatory proteins direct the addition of ubiquitin to lysine residues on target proteins, and these are countered by an army of deubiquitinating enzymes (DUBs). BRCA1-associated protein-1 (Bap1) is a ubiquitin carboxy-terminal hydrolase and is frequently mutated in lung and sporadic breast tumors. Bap1 can suppress growth of lung cancer cells in athymic nude mice and this requires its DUB activity. We show here that Bap1 interacts with host cell factor 1 (HCF-1), a transcriptional cofactor found in a number of important regulatory complexes. Bap1 binds to the HCF-1 β-propeller using a variant of the HCF-binding motif found in herpes simplex virus VP16 and other HCF-interacting proteins. HCF-1 is K48 and K63 ubiquitinated, with a major site of linkage at lysines 1807 and 1808 in the HCF-1C subunit. Expression of a catalytically inactive version of Bap1 results in the selective accumulation of K48 ubiquitinated polypeptides. Depletion of Bap1 using small interfering RNA results in a modest accumulation of HCF-1C, suggesting that Bap1 helps to control cell proliferation by regulating HCF-1 protein levels and by associating with genes involved in the G1-S transition.
The Protein Structural Initiative (PSI) at the US National Institutes of Health (NIH) is funding four large-scale centers for structural genomics (SG). These centers systematically target many large families without structural coverage, as well as very large families with inadequate structural coverage. Here, we report a few simple metrics that demonstrate how successfully these efforts optimize structural coverage: while the PSI-2 (2005-now) contributed more than 8% of all structures deposited into the PDB, it contributed over 20% of all novel structures (i.e. structures for protein sequences with no structural representative in the PDB on the date of deposition). The structural coverage of the protein universe represented by today’s UniProt (v12.8) has increased linearly from 1992 to 2008; structural genomics has contributed significantly to the maintenance of this growth rate. Success in increasing novel leverage (defined in Liu et al. in Nat Biotechnol 25:849–851, 2007) has resulted from systematic targeting of large families. PSI’s per structure contribution to novel leverage was over 4-fold higher than that for non-PSI structural biology efforts during the past 8 years. If the success of the PSI continues, it may just take another ~15 years to cover most sequences in the current UniProt database.
Protein structure determination; Structural genomics; Evolution; Protein universe
A large-scale survey using single nucleotide polymorphism data from dbSNP provides insights into the evolutionary selection constraints on human proteins of different structural and functional categories.
The rates of molecular evolution for protein-coding genes depend on the stringency of functional or structural constraints. The Ka/Ks ratio has been commonly used as an indicator of selective constraints and is typically calculated from interspecies alignments. Recent accumulation of single nucleotide polymorphism (SNP) data has enabled the derivation of Ka/Ks ratios for polymorphism (SNP A/S ratios).
Using data from the dbSNP database, we conducted the first large-scale survey of SNP A/S ratios for different structural and functional properties. We confirmed that the SNP A/S ratio is largely correlated with Ka/Ks for divergence. We observed stronger selective constraints for proteins that have high mRNA expression levels or broad expression patterns, have no paralogs, arose earlier in evolution, have natively disordered regions, are located in cytoplasm and nucleus, or are related to human diseases. On the residue level, we found higher degrees of variation for residues that are exposed to solvent, are in a loop conformation, natively disordered regions or low complexity regions, or are in the signal peptides of secreted proteins. Our analysis also revealed that histones and protein kinases are among the protein families that are under the strongest selective constraints, whereas olfactory and taste receptors are among the most variable groups.
Our study suggests that the SNP A/S ratio is a robust measure for selective constraints. The correlations between SNP A/S ratios and other variables provide valuable insights into the natural selection of various structural or functional properties, particularly for human-specific genes and constraints within the human lineage.
We survey computational approaches that tackle membrane protein structure and function prediction. While describing the main ideas that have led to the development of the most relevant and novel methods, we also discuss pitfalls, provide practical hints and highlight the challenges that remain. The methods covered include: sequence alignment, motif search, functional residue identification, transmembrane segment and protein topology predictions, homology and ab initio modeling. Overall, predictions of functional and structural features of membrane proteins are improving, although progress is hampered by the limited amount of high-resolution experimental information available. While predictions of transmembrane segments and protein topology rank among the most accurate methods in computational biology, more attention and effort will be required in the future to ameliorate database search, homology and ab initio modeling.
membrane proteins; protein structure prediction; protein function prediction; alignments; transmembrane segment prediction; homology modeling; ab initio modeling
Natively unstructured or disordered protein regions may increase the functional complexity of an organism; they are particularly abundant in eukaryotes and often evade structure determination. Many computational methods predict unstructured regions by training on outliers in otherwise well-ordered structures. Here, we introduce an approach that uses a neural network in a very different and novel way. We hypothesize that very long contiguous segments with nonregular secondary structure (NORS regions) differ significantly from regular, well-structured loops, and that a method detecting such features could predict natively unstructured regions. Training our new method, NORSnet, on predicted information rather than on experimental data yielded three major advantages: it removed the overlap between testing and training, it systematically covered entire proteomes, and it explicitly focused on one particular aspect of unstructured regions with a simple structural interpretation, namely that they are loops. Our hypothesis was correct: well-structured and unstructured loops differ so substantially that NORSnet succeeded in their distinction. Benchmarks on previously used and new experimental data of unstructured regions revealed that NORSnet performed very well. Although it was not the best single prediction method, NORSnet was sufficiently accurate to flag unstructured regions in proteins that were previously not annotated. In one application, NORSnet revealed previously undetected unstructured regions in putative targets for structural genomics and may thereby contribute to increasing structural coverage of large eukaryotic families. NORSnet found unstructured regions more often in domain boundaries than expected at random. In another application, we estimated that 50%–70% of all worm proteins observed to have more than seven protein–protein interaction partners have unstructured regions. The comparative analysis between NORSnet and DISOPRED2 suggested that long unstructured loops are a major part of unstructured regions in molecular networks.
The details of protein structures are important for function. Regions that do not adopt any regular structure in isolation (natively unstructured or disordered regions) initially appeared as a curious exception to this structure–function paradigm. It has become increasingly clear that unstructured regions are fundamental to many roles and that they are particularly important for multicellular organisms. Structural biology is just beginning to apprehend the stunning diversity of these roles. Here, we focused on unstructured regions dominated by a particular type of loop, namely the natively unstructured one. We developed a method that succeeded in the distinction between well-structured and natively unstructured loops. For the development, we did not use any experimental data for unstructured regions; when tested on experimental data, the method performed surprisingly well. Due to its different premises, the method captured very different aspects of unstructured regions than other methods that we tested. We applied the new method to two different problems. The first was the identification of proteins that may be difficult targets for structure determination. The second was the identification of worm proteins that have many interaction partners (more than seven) and unstructured regions. Surprisingly, we found unstructured regions of the loopy type in more than 50% of all the promiscuous worm proteins.
RIKEN's FANTOM project has revealed many previously unknown coding sequences, as well as an unexpected degree of variation in transcripts resulting from alternative promoter usage and splicing. Ever more transcripts that do not code for proteins have been identified by transcriptome studies, in general. Increasing evidence points to the important cellular roles of such non-coding RNAs (ncRNAs). The distinction of protein-coding RNA transcripts from ncRNA transcripts is therefore an important problem in understanding the transcriptome and carrying out its annotation. Very few in silico methods have specifically addressed this problem. Here, we introduce CONC (for “coding or non-coding”), a novel method based on support vector machines that classifies transcripts according to features they would have if they were coding for proteins. These features include peptide length, amino acid composition, predicted secondary structure content, predicted percentage of exposed residues, compositional entropy, number of homologs from database searches, and alignment entropy. Nucleotide frequencies are also incorporated into the method. Confirmed coding cDNAs for eukaryotic proteins from the Swiss-Prot database constituted the set of true positives, ncRNAs from RNAdb and NONCODE the true negatives. Ten-fold cross-validation suggested that CONC distinguished coding RNAs from ncRNAs at about 97% specificity and 98% sensitivity. Applied to 102,801 mouse cDNAs from the FANTOM3 dataset, our method reliably identified over 14,000 ncRNAs and estimated the total number of ncRNAs to be about 28,000.
There are two types of RNA: messenger RNAs (mRNAs), which are translated into proteins, and non-coding RNAs (ncRNAs), which function as RNA molecules. Besides textbook examples such as tRNAs and rRNAs, non-coding RNAs have been found to carry out very diverse functions, from mRNA splicing and RNA modification to translational regulation. It has been estimated that non-coding RNAs make up the vast majority of transcription output of higher eukaryotes. Discriminating mRNA from ncRNA has become an important biological and computational problem. The authors describe a computational method based on a machine learning algorithm known as a support vector machine (SVM) that classifies transcripts according to features they would have if they were coding for proteins. These features include peptide length, amino acid composition, secondary structure content, and protein alignment information. The method is applied to the dataset from the FANTOM3 large-scale mouse cDNA sequencing project; it identifies over 14,000 ncRNAs in mouse and estimates the total number of ncRNAs in the FANTOM3 data to be about 28,000.
Guessing the boundaries of structural domains has been an important and challenging problem in experimental and computational structural biology. Predictions were based on intuition, biochemical properties, statistics, sequence homology and other aspects of predicted protein structure. Here, we introduced CHOPnet, a de novo method that predicts structural domains in the absence of homology to known domains. Our method was based on neural networks and relied exclusively on information available for all proteins. Evaluating sustained performance through rigorous cross-validation on proteins of known structure, we correctly predicted the number of domains in 69% of all proteins. For 50% of the two-domain proteins the centre of the predicted boundary was closer than 20 residues to the boundary assigned from three-dimensional (3D) structures; this was about eight percentage points better than predictions by ‘equal split’. Our results appeared to compare favourably with those from previously published methods. CHOPnet may be useful to restrict the experimental testing of different fragments for structure determination in the context of structural genomics.
Sequence-based domain assignment is one of the most important and challenging problems in structural biology. We have developed a method, CHOP, that chops proteins into domain-like fragments. The basic idea is to cut proteins from entirely sequenced organisms beginning from very reliable experimental information (Protein Data Bank), proceeding to expert annotations of domain-like regions (Pfam-A) and completing through cuts based on termini of native protein ends. The CHOP server takes protein sequences as input and returns the dissections supported by homology transfer. CHOP results are precompiled for many entirely sequenced proteomes. The service is available at http://www.rostlab.org/services/CHOP/.
PredictProtein (http://www.predictprotein.org) is an Internet service for sequence analysis and the prediction of protein structure and function. Users submit protein sequences or alignments; PredictProtein returns multiple sequence alignments, PROSITE sequence motifs, low-complexity regions (SEG), nuclear localization signals, regions lacking regular structure (NORS) and predictions of secondary structure, solvent accessibility, globular regions, transmembrane helices, coiled-coil regions, structural switch regions, disulfide-bonds, sub-cellular localization and functional annotations. Upon request fold recognition by prediction-based threading, CHOP domain assignments, predictions of transmembrane strands and inter-residue contacts are also available. For all services, users can submit their query either by electronic mail or interactively via the World Wide Web.
Very few methods address the problem of predicting beta-barrel membrane proteins directly from sequence. One reason is that only very few high-resolution structures for transmembrane beta-barrel (TMB) proteins have been determined thus far. Here we introduced the design, statistics and results of a novel profile-based hidden Markov model for the prediction and discrimination of TMBs. The method carefully attempts to avoid over-fitting the sparse experimental data. While our model training and scoring procedures were very similar to a recently published work, the architecture and structure-based labelling were significantly different. In particular, we introduced a new definition of beta- hairpin motifs, explicit state modelling of transmembrane strands, and a log-odds whole-protein discrimination score. The resulting method reached an overall four-state (up-, down-strand, periplasmic-, outer-loop) accuracy as high as 86%. Furthermore, accurately discriminated TMB from non-TMB proteins (45% coverage at 100% accuracy). This high precision enabled the application to 72 entirely sequenced Gram-negative bacteria. We found over 164 previously uncharacterized TMB proteins at high confidence. Database searches did not implicate any of these proteins with membranes. We challenge that the vast majority of our 164 predictions will eventually be verified experimentally. All proteome predictions and the PROFtmb prediction method are available at http://www.rostlab.org/services/PROFtmb/.
Many structurally flexible regions play important roles in biological processes. It has been shown that extended loopy regions are very abundant in the protein universe and that they have been conserved through evolution. Here, we present NORSp, a publicly available predictor for disordered regions in protein. Specifically, NORSp predicts long regions with NO Regular Secondary structure. Upon user submission of a protein sequence, NORSp will analyse the protein for its secondary structure, presence of transmembrane helices and coiled-coil. It will then return email to the user about the presence and position of disordered regions. NORSp can be accessed from http://cubic.bioc.columbia.edu/services/NORSp/.