1.  AlmostSignificant: simplifying quality control of high-throughput sequencing data 
Bioinformatics  2016;32(24):3850-3851.
Motivation: The current generation of DNA sequencing technologies produce a large amount of data quickly. All of these data need to pass some form of quality control (QC) processing and checking before they can be used for any analysis. The large number of samples that are run through Illumina sequencing machines makes the process of QC an onerous and time-consuming task that requires multiple pieces of information from several sources.
Results: AlmostSignificant is an open-source platform for aggregating multiple sources of quality metrics as well as run and sample meta-data associated with DNA sequencing runs from Illumina sequencing machines. AlmostSignificant is a graphical platform to streamline the QC of DNA sequencing data, to store these data for future reference together with extra meta-data associated with the sequencing runs not typically retained. This simplifies the challenge of monitoring the volume of data produced by Illumina sequencers. AlmostSignificant has been used to track the quality of over 80 sequencing runs covering over 2500 samples produced over the last three years.
Availability and Implementation: The code and documentation for AlmostSignificant is freely available at
Contacts: or
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC5167069  PMID: 27559158
2.  Filaggrin-stratified transcriptomic analysis of pediatric skin identifies mechanistic pathways in patients with atopic dermatitis 
Atopic dermatitis (AD; eczema) is characterized by a widespread abnormality in cutaneous barrier function and propensity to inflammation. Filaggrin is a multifunctional protein and plays a key role in skin barrier formation. Loss-of-function mutations in the gene encoding filaggrin (FLG) are a highly significant risk factor for atopic disease, but the molecular mechanisms leading to dermatitis remain unclear.
We sought to interrogate tissue-specific variations in the expressed genome in the skin of children with AD and to investigate underlying pathomechanisms in atopic skin.
We applied single-molecule direct RNA sequencing to analyze the whole transcriptome using minimal tissue samples. Uninvolved skin biopsy specimens from 26 pediatric patients with AD were compared with site-matched samples from 10 nonatopic teenage control subjects. Cases and control subjects were screened for FLG genotype to stratify the data set.
Two thousand four hundred thirty differentially expressed genes (false discovery rate, P < .05) were identified, of which 211 were significantly upregulated and 490 downregulated by greater than 2-fold. Gene ontology terms for “extracellular space” and “defense response” were enriched, whereas “lipid metabolic processes” were downregulated. The subset of FLG wild-type cases showed dysregulation of genes involved with lipid metabolism, whereas filaggrin haploinsufficiency affected global gene expression and was characterized by a type 1 interferon–mediated stress response.
These analyses demonstrate the importance of extracellular space and lipid metabolism in atopic skin pathology independent of FLG genotype, whereas an aberrant defense response is seen in subjects with FLG mutations. Genotype stratification of the large data set has facilitated functional interpretation and might guide future therapy development.
PMCID: PMC4090750  PMID: 24880632
Atopic dermatitis; direct RNA sequencing; eczema; filaggrin; gene expression; single molecule; skin; tissue; transcriptome; AD, Atopic dermatitis; CILP, Cartilage intermediate layer protein gene; DRS, Direct RNA sequencing; eQTL, Expression quantitative trait loci; FDR, False discovery rate; FLG, Filaggrin gene; GO, Gene ontology; STAT, Signal transducer and activator of transcription
3.  Elevated O‐GlcNAc Levels Activate Epigenetically Repressed Genes and Delay Mouse ESC Differentiation Without Affecting Naïve to Primed Cell Transition 
Stem Cells (Dayton, Ohio)  2014;32(10):2605-2615.
The differentiation of mouse embryonic stem cells (ESCs) is controlled by the interaction of multiple signaling pathways, typically mediated by post‐translational protein modifications. The addition of O‐linked N‐acetylglucosamine (O‐GlcNAc) to serine and threonine residues of nuclear and cytoplasmic proteins is one such modification (O‐GlcNAcylation), whose function in ESCs is only now beginning to be elucidated. Here, we demonstrate that the specific inhibition of O‐GlcNAc hydrolase (Oga) causes increased levels of protein O‐GlcNAcylation and impairs differentiation of mouse ESCs both in serum‐free monolayer and in embryoid bodies (EBs). Use of reporter cell lines demonstrates that Oga inhibition leads to a reduction in the number of Sox1‐expressing neural progenitors generated following induction of neural differentiation as well as maintained expression of the ESC marker Oct4 (Pou5f1). In EBs, expression of mesodermal and endodermal markers is also delayed. However, the transition of naïve cells to primed pluripotency indicated by Rex1 (Zfp42), Nanog, Esrrb, and Dppa3 downregulation and Fgf5 upregulation remains unchanged. Finally, we demonstrate that increased O‐GlcNAcylation results in upregulation of genes normally epigenetically silenced in ESCs, supporting the emerging role for this protein modification in the regulation of histone modifications and DNA methylation. Stem Cells 2014;32:2605–2615
PMCID: PMC4737245  PMID: 24898611
Embryonic stem cells; Cell differentiation; O‐GlcNAc; Post‐translational protein modification; Signal transduction; Oligonucleotide microarrays
4.  Direct Sequencing of Arabidopsis thaliana RNA Reveals Patterns of Cleavage and Polyadenylation 
It has recently been shown that RNA 3′ end formation plays a more widespread role in controlling gene expression than previously thought. In order to examine the impact of regulated 3′ end formation genome-wide we applied direct RNA sequencing to A. thaliana. Here we show the authentic transcriptome in unprecedented detail and how 3′ end formation impacts genome organization. We reveal extreme heterogeneity in RNA 3′ ends, discover previously unrecognized non-coding RNAs and propose widespread re-annotation of the genome. We explain the origin of most poly(A)+ antisense RNAs and identify cis-elements that control 3′ end formation in different registers. These findings are essential to understand what the genome actually encodes, how it is organized and the impact of regulated 3′ end formation on these processes.
PMCID: PMC3533403  PMID: 22820990
5.  Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment 
Bioinformatics  2015;31(22):3625-3630.
Motivation: High-throughput RNA sequencing (RNA-seq) is now the standard method to determine differential gene expression. Identifying differentially expressed genes crucially depends on estimates of read-count variability. These estimates are typically based on statistical models such as the negative binomial distribution, which is employed by the tools edgeR, DESeq and cuffdiff. Until now, the validity of these models has usually been tested on either low-replicate RNA-seq data or simulations.
Results: A 48-replicate RNA-seq experiment in yeast was performed and data tested against theoretical models. The observed gene read counts were consistent with both log-normal and negative binomial distributions, while the mean-variance relation followed the line of constant dispersion parameter of ∼0.01. The high-replicate data also allowed for strict quality control and screening of ‘bad’ replicates, which can drastically affect the gene read-count distribution.
Availability and implementation: RNA-seq data have been submitted to ENA archive with project ID PRJEB5348.
PMCID: PMC4754627  PMID: 26206307
6.  JPred4: a protein secondary structure prediction server 
Nucleic Acids Research  2015;43(Web Server issue):W389-W394.
JPred4 ( is the latest version of the popular JPred protein secondary structure prediction server which provides predictions by the JNet algorithm, one of the most accurate methods for secondary structure prediction. In addition to protein secondary structure, JPred also makes predictions of solvent accessibility and coiled-coil regions. The JPred service runs up to 94 000 jobs per month and has carried out over 1.5 million predictions in total for users in 179 countries. The JPred4 web server has been re-implemented in the Bootstrap framework and JavaScript to improve its design, usability and accessibility from mobile devices. JPred4 features higher accuracy, with a blind three-state (α-helix, β-strand and coil) secondary structure prediction accuracy of 82.0% while solvent accessibility prediction accuracy has been raised to 90% for residues <5% accessible. Reporting of results is enhanced both on the website and through the optional email summaries and batch submission results. Predictions are now presented in SVG format with options to view full multiple sequence alignments with and without gaps and insertions. Finally, the help-pages have been updated and tool-tips added as well as step-by-step tutorials.
PMCID: PMC4489285  PMID: 25883141
7.  14-3-3-Pred: improved methods to predict 14-3-3-binding phosphopeptides 
Bioinformatics  2015;31(14):2276-2283.
Motivation: The 14-3-3 family of phosphoprotein-binding proteins regulates many cellular processes by docking onto pairs of phosphorylated Ser and Thr residues in a constellation of intracellular targets. Therefore, there is a pressing need to develop new prediction methods that use an updated set of 14-3-3-binding motifs for the identification of new 14-3-3 targets and to prioritize the downstream analysis of >2000 potential interactors identified in high-throughput experiments.
Results: Here, a comprehensive set of 14-3-3-binding targets from the literature was used to develop 14-3-3-binding phosphosite predictors. Position-specific scoring matrix, support vector machines (SVM) and artificial neural network (ANN) classification methods were trained to discriminate experimentally determined 14-3-3-binding motifs from non-binding phosphopeptides. ANN, position-specific scoring matrix and SVM methods showed best performance for a motif window spanning from −6 to +4 around the binding phosphosite, achieving Matthews correlation coefficient of up to 0.60. Blind prediction showed that all three methods outperform two popular 14-3-3-binding site predictors, Scansite and ELM. The new methods were used for prediction of 14-3-3-binding phosphosites in the human proteome. Experimental analysis of high-scoring predictions in the FAM122A and FAM122B proteins confirms the predictions and suggests the new 14-3-3-predictors will be generally useful.
Availability and implementation: A standalone prediction web server is available at Human candidate 14-3-3-binding phosphosites were integrated in ANIA: ANnotation and Integrated Analysis of the 14-3-3 interactome database.
Contact: or
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4495292  PMID: 25735772
8.  Major transcriptome re-organisation and abrupt changes in signalling, cell cycle and chromatin regulation at neural differentiation in vivo 
Development (Cambridge, England)  2014;141(16):3266-3276.
Here, we exploit the spatial separation of temporal events of neural differentiation in the elongating chick body axis to provide the first analysis of transcriptome change in progressively more differentiated neural cell populations in vivo. Microarray data, validated against direct RNA sequencing, identified: (1) a gene cohort characteristic of the multi-potent stem zone epiblast, which contains neuro-mesodermal progenitors that progressively generate the spinal cord; (2) a major transcriptome re-organisation as cells then adopt a neural fate; and (3) increasing diversity as neural patterning and neuron production begin. Focussing on the transition from multi-potent to neural state cells, we capture changes in major signalling pathways, uncover novel Wnt and Notch signalling dynamics, and implicate new pathways (mevalonate pathway/steroid biogenesis and TGFβ). This analysis further predicts changes in cellular processes, cell cycle, RNA-processing and protein turnover as cells acquire neural fate. We show that these changes are conserved across species and provide biological evidence for reduced proteasome efficiency and a novel lengthening of S phase. This latter step may provide time for epigenetic events to mediate large-scale transcriptome re-organisation; consistent with this, we uncover simultaneous downregulation of major chromatin modifiers as the neural programme is established. We further demonstrate that transcription of one such gene, HDAC1, is dependent on FGF signalling, making a novel link between signals that control neural differentiation and transcription of a core regulator of chromatin organisation. Our work implicates new signalling pathways and dynamics, cellular processes and epigenetic modifiers in neural differentiation in vivo, identifying multiple new potential cellular and molecular mechanisms that direct differentiation.
PMCID: PMC4197544  PMID: 25063452
Neural differentiation; Transcriptome; Cell cycle; FGF signalling; Chromatin; Chick embryo
9.  Improved Annotation of 3′ Untranslated Regions and Complex Loci by Combination of Strand-Specific Direct RNA Sequencing, RNA-Seq and ESTs 
PLoS ONE  2014;9(4):e94270.
The reference annotations made for a genome sequence provide the framework for all subsequent analyses of the genome. Correct and complete annotation in addition to the underlying genomic sequence is particularly important when interpreting the results of RNA-seq experiments where short sequence reads are mapped against the genome and assigned to genes according to the annotation. Inconsistencies in annotations between the reference and the experimental system can lead to incorrect interpretation of the effect on RNA expression of an experimental treatment or mutation in the system under study. Until recently, the genome-wide annotation of 3′ untranslated regions received less attention than coding regions and the delineation of intron/exon boundaries. In this paper, data produced for samples in Human, Chicken and A. thaliana by the novel single-molecule, strand-specific, Direct RNA Sequencing technology from Helicos Biosciences which locates 3′ polyadenylation sites to within +/− 2 nt, were combined with archival EST and RNA-Seq data. Nine examples are illustrated where this combination of data allowed: (1) gene and 3′ UTR re-annotation (including extension of one 3′ UTR by 5.9 kb); (2) disentangling of gene expression in complex regions; (3) clearer interpretation of small RNA expression and (4) identification of novel genes. While the specific examples displayed here may become obsolete as genome sequences and their annotations are refined, the principles laid out in this paper will be of general use both to those annotating genomes and those seeking to interpret existing publically available annotations in the context of their own experimental data.
PMCID: PMC3983147  PMID: 24722185
10.  Haploinsufficiency for AAGAB causes clinically heterogeneous forms of punctate palmoplantar keratoderma 
Nature genetics  2012;44(11):10.1038/ng.2444.
Palmoplantar keratodermas (PPKs) are a group of disorders that are diagnostically and therapeutically problematic in dermatogenetics1-3. Punctate PPKs are characterized by circumscribed hyperkeratotic lesions on palms and soles with considerable heterogeneity. In 18 families with autosomal dominant punctate PPK (OMIM #148600), we report heterozygous loss-of-function mutations in AAGAB, encoding alpha- and gamma-adaptin binding protein p34, at a previously linked locus on 15q22. p34, a cytosolic protein with a Rab-like GTPase domain, was shown to bind both clathrin adaptor protein complexes, indicative of a role in membrane traffic. Ultrastucturally, lesional epidermis showed abnormalities in intracellular vesicle biology. Immunohistochemistry showed hyperproliferation within the punctate lesions. Knockdown of p34 in keratinocytes led to increased cell division, which was linked to greatly increased epidermal growth factor receptor (EGFR) protein expression and tyrosine phosphorylation. We hypothesize that p34 deficiency may impair endocytic recycling of growth factor receptors such as EGFR, leading to increased signaling and proliferation.
PMCID: PMC3836166  PMID: 23064416
11.  Tmem79/Matt is the matted mouse gene and is a predisposing gene for atopic dermatitis in human subjects 
Atopic dermatitis (AD) is a major inflammatory condition of the skin caused by inherited skin barrier deficiency, with mutations in the filaggrin gene predisposing to development of AD. Support for barrier deficiency initiating AD came from flaky tail mice, which have a frameshift mutation in Flg and also carry an unknown gene, matted, causing a matted hair phenotype.
We sought to identify the matted mutant gene in mice and further define whether mutations in the human gene were associated with AD.
A mouse genetics approach was used to separate the matted and Flg mutations to produce congenic single-mutant strains for genetic and immunologic analysis. Next-generation sequencing was used to identify the matted gene. Five independently recruited AD case collections were analyzed to define associations between single nucleotide polymorphisms (SNPs) in the human gene and AD.
The matted phenotype in flaky tail mice is due to a mutation in the Tmem79/Matt gene, with no expression of the encoded protein mattrin in the skin of mutant mice. Mattft mice spontaneously have dermatitis and atopy caused by a defective skin barrier, with mutant mice having systemic sensitization after cutaneous challenge with house dust mite allergens. Meta-analysis of 4,245 AD cases and 10,558 population-matched control subjects showed that a missense SNP, rs6694514, in the human MATT gene has a small but significant association with AD.
In mice mutations in Matt cause a defective skin barrier and spontaneous dermatitis and atopy. A common SNP in MATT has an association with AD in human subjects.
PMCID: PMC3834151  PMID: 24084074
Allergy; association; atopic dermatitis; atopy; eczema; filaggrin; flaky tail; Matt; mattrin; mouse; mutation; Tmem79; AD, Atopic dermatitis; DM, Double mutant; FLG, Filaggrin; HDM, House dust mite; hpf, High-power field; MAPEG, Membrane-associated proteins in eicosanoid and glutathione metabolism; OR, Odds ratio; SNP, Single nucleotide polymorphism; TEWL, Transepidermal water loss; WT, Wild-type
12.  Transcription Termination and Chimeric RNA Formation Controlled by Arabidopsis thaliana FPA 
PLoS Genetics  2013;9(10):e1003867.
Alternative cleavage and polyadenylation influence the coding and regulatory potential of mRNAs and where transcription termination occurs. Although widespread, few regulators of this process are known. The Arabidopsis thaliana protein FPA is a rare example of a trans-acting regulator of poly(A) site choice. Analysing fpa mutants therefore provides an opportunity to reveal generic consequences of disrupting this process. We used direct RNA sequencing to quantify shifts in RNA 3′ formation in fpa mutants. Here we show that specific chimeric RNAs formed between the exons of otherwise separate genes are a striking consequence of loss of FPA function. We define intergenic read-through transcripts resulting from defective RNA 3′ end formation in fpa mutants and detail cryptic splicing and antisense transcription associated with these read-through RNAs. We identify alternative polyadenylation within introns that is sensitive to FPA and show FPA-dependent shifts in IBM1 poly(A) site selection that differ from those recently defined in mutants defective in intragenic heterochromatin and DNA methylation. Finally, we show that defective termination at specific loci in fpa mutants is shared with dicer-like 1 (dcl1) or dcl4 mutants, leading us to develop alternative explanations for some silencing roles of these proteins. We relate our findings to the impact that altered patterns of 3′ end formation can have on gene and genome organisation.
Author Summary
The ends of almost all eukaryotic protein-coding genes are defined by a poly(A) signal. When genes are transcribed into mRNA by RNA polymerase II, the poly(A) signal guides cleavage of the precursor mRNA at a particular site; this is accompanied by the addition of a poly(A) tail to the mRNA and termination of transcription. Many genes have more than one poly(A) signal and the regulated choice of which to select can effectively determine what the gene will code for, how the gene can be regulated and where transcription termination occurs. We discovered a rare example of a regulator of poly(A) site choice, called FPA, while studying flower development in the model plant Arabidopsis thaliana. Studying FPA therefore provides an opportunity to understand not only its roles in plant biology but also the generic consequences of disrupting alternative polyadenylation. In this study, we use a technique called direct RNA sequencing to quantify genome-wide shifts in poly(A) site selection in plants that lack FPA function. One of our most striking findings is that in the absence of FPA we detect chimeric RNAs formed between two otherwise separate and well-characterised genes.
PMCID: PMC3814327  PMID: 24204292
13.  The RNA-binding protein FPA regulates flg22-triggered defense responses and transcription factor activity by alternative polyadenylation 
Scientific Reports  2013;3:2866.
RNA-binding proteins (RBPs) play an important role in plant host-microbe interactions. In this study, we show that the plant RBP known as FPA, which regulates 3′-end mRNA polyadenylation, negatively regulates basal resistance to bacterial pathogen Pseudomonas syringae in Arabidopsis. A custom microarray analysis reveals that flg22, a peptide derived from bacterial flagellins, induces expression of alternatively polyadenylated isoforms of mRNA encoding the defence-related transcriptional repressor ETHYLENE RESPONSE FACTOR 4 (ERF4), which is regulated by FPA. Flg22 induces expression of a novel isoform of ERF4 that lacks the ERF-associated amphiphilic repression (EAR) motif, while FPA inhibits this induction. The EAR-lacking isoform of ERF4 acts as a transcriptional activator in vivo and suppresses the flg22-dependent reactive oxygen species burst. We propose that FPA controls use of proximal polyadenylation sites of ERF4, which quantitatively limit the defence response output.
PMCID: PMC3793224  PMID: 24104185
14.  A new family of transcription factors 
Development (Cambridge, England)  2008;135(18):3093-3101.
CudA, a nuclear protein required for Dictyostelium prespore-specific gene expression, binds in vivo to the promoter of the cotC prespore gene. A 14 nucleotide region of the cotC promoter binds CudA in vitro and ECudA, an Entamoeba CudA homologue, also binds to this site. The CudA and ECudA DNA-binding sites contain a dyad and, consistent with a symmetrical binding site, CudA forms a homodimer in the yeast two-hybrid system. Mutation of CudA binding sites within the cotC promoter reduces expression from cotC in prespore cells. The CudA and ECudA proteins share a 120 amino acid core of homology, and clustered point mutations introduced into two highly conserved motifs within the ECudA core region decrease its specific DNA binding in vitro. This region, the presumptive DNA-binding domain, is similar in sequence to domains in two Arabidopsis proteins and one Oryza protein. Significantly, these are the only proteins in the two plant species that contain an SH2 domain. Such a structure, with a DNA-binding domain located upstream of an SH2 domain, suggests that the plant proteins are orthologous to metazoan STATs. Consistent with this notion, the DNA sequence of the CudA half site, GAA, is identical to metazoan STAT half sites, although the relative positions of the two halves of the dyad are reversed. These results define a hitherto unrecognised class of transcription factors and suggest a model for the evolution of STATs and their DNA-binding sites.
PMCID: PMC3586674  PMID: 18701541
Dictyostelium; CudA; Amoeboza; Plant STATs; SH2 domains
15.  Human box C/D snoRNA processing conservation across multiple cell types 
Nucleic Acids Research  2011;40(8):3676-3688.
Small nucleolar RNAs (snoRNAs) function mainly as guides for the post-transcriptional modification of ribosomal RNAs (rRNAs). In recent years, several studies have identified a wealth of small fragments (<35 nt) derived from snoRNAs (termed sdRNAs) that stably accumulate in the cell, some of which may regulate splicing or translation. A comparison of human small RNA deep sequencing data sets reveals that box C/D sdRNA accumulation patterns are conserved across multiple cell types although the ratio of the abundance of different sdRNAs from a given snoRNA varies. sdRNA profiles of many snoRNAs are specific and resemble the cleavage profiles of miRNAs. Many do not show characteristics of general RNA degradation, as seen for the accumulation of small fragments derived from snRNA or rRNA. While 53% of the sdRNAs contain an snoRNA box C motif and boxes D and D′ are also common in sdRNAs (54%), relatively few (12%) contain a full snoRNA guide region. One box C/D snoRNA, HBII-180C, was analysed in greater detail, revealing the presence of C′ box-containing sdRNAs complementary to several pre-messenger RNAs (pre-mRNAs) including FGFR3. Functional analyses demonstrated that this region of HBII-180C can influence the alternative splicing of FGFR3 pre-mRNA, supporting a role for some snoRNAs in the regulation of splicing.
PMCID: PMC3333852  PMID: 22199253
16.  Computational approaches to selecting and optimising targets for structural biology 
Methods (San Diego, Calif.)  2011;55(1):3-11.
► Identifies key considerations in target selection and optimisation. ► Approaches to assign useful protein features and structure/function relationships. ► Comparison of latest crystallisation propensity predictors on nonredundant data. ► Discusses single point of reference target selection/optimisation resources. ► Guidance on using the SSPF Target Optimisation Utility (TarO).
Selection of protein targets for study is central to structural biology and may be influenced by numerous factors. A key aim is to maximise returns for effort invested by identifying proteins with the balance of biophysical properties that are conducive to success at all stages (e.g. solubility, crystallisation) in the route towards a high resolution structural model. Selected targets can be optimised through construct design (e.g. to minimise protein disorder), switching to a homologous protein, and selection of experimental methodology (e.g. choice of expression system) to prime for efficient progress through the structural proteomics pipeline.
Here we discuss computational techniques in target selection and optimisation, with more detailed focus on tools developed within the Scottish Structural Proteomics Facility (SSPF); namely XANNpred, ParCrys, OB-Score (target selection) and TarO (target optimisation). TarO runs a large number of algorithms, searching for homologues and annotating the pool of possible alternative targets. This pool of putative homologues is presented in a ranked, tabulated format and results are also visualised as an automatically generated and annotated multiple sequence alignment. The target selection algorithms each predict the propensity of a selected protein target to progress through the experimental stages leading to diffracting crystals. This single predictor approach has advantages for target selection, when compared with an approach using two or more predictors that each predict for success at a single experimental stage. The tools described here helped SSPF achieve a high (21%) success rate in progressing cloned targets to diffraction-quality crystals.
PMCID: PMC3202631  PMID: 21906678
MSA, Multiple Sequence Alignment; PTM, Post Translational Modification; SSPF, Scottish Structural Proteomics Facility; MCC, Matthew’s correlation coefficient; AROC, Area Under the Receiver Operator Characteristic curve; Target selection; Crystallisation; Structural genomics; Structural biology; Bioinformatics; Construct design
17.  NoD: a Nucleolar localization sequence detector for eukaryotic and viral proteins 
BMC Bioinformatics  2011;12:317.
Nucleolar localization sequences (NoLSs) are short targeting sequences responsible for the localization of proteins to the nucleolus. Given the large number of proteins experimentally detected in the nucleolus and the central role of this subnuclear compartment in the cell, NoLSs are likely to be important regulatory elements controlling cellular traffic. Although many proteins have been reported to contain NoLSs, the systematic characterization of this group of targeting motifs has only recently been carried out.
Here, we describe NoD, a web server and a command line program that predicts the presence of NoLSs in proteins. Using the web server, users can submit protein sequences through the NoD input form and are provided with a graphical output of the NoLS score as a function of protein position. While the web server is most convenient for making prediction for just a few proteins, the command line version of NoD can return predictions for complete proteomes. NoD is based on our recently described human-trained artificial neural network predictor. Through stringent independent testing of the predictor using available experimentally validated NoLS-containing eukaryotic and viral proteins, the NoD sensitivity and positive predictive value were estimated to be 71% and 79% respectively.
NoD is the first tool to provide predictions of nucleolar localization sequences in diverse eukaryotes and viruses. NoD can be run interactively online at or downloaded to use locally.
PMCID: PMC3166288  PMID: 21812952
nucleolus; protein targeting signal; protein localization; NoD web server
18.  Java bioinformatics analysis web services for multiple sequence alignment—JABAWS:MSA 
Bioinformatics  2011;27(14):2001-2002.
Summary: JABAWS is a web services framework that simplifies the deployment of web services for bioinformatics. JABAWS:MSA provides services for five multiple sequence alignment (MSA) methods (Probcons, T-coffee, Muscle, Mafft and ClustalW), and is the system employed by the Jalview multiple sequence analysis workbench since version 2.6. A fully functional, easy to set up server is provided as a Virtual Appliance (VA), which can be run on most operating systems that support a virtualization environment such as VMware or Oracle VirtualBox. JABAWS is also distributed as a Web Application aRchive (WAR) and can be configured to run on a single computer and/or a cluster managed by Grid Engine, LSF or other queuing systems that support DRMAA. JABAWS:MSA provides clients full access to each application's parameters, allows administrators to specify named parameter preset combinations and execution limits for each application through simple configuration files. The JABAWS command-line client allows integration of JABAWS services into conventional scripts.
Availability and Implementation: JABAWS is made freely available under the Apache 2 license and can be obtained from:
PMCID: PMC3129525  PMID: 21593132
19.  Global network analysis of drug tolerance, mode of action and virulence in methicillin-resistant S. aureus 
BMC Systems Biology  2011;5:68.
Staphylococcus aureus is a major human pathogen and strains resistant to existing treatments continue to emerge. Development of novel treatments is therefore important. Antimicrobial peptides represent a source of potential novel antibiotics to combat resistant bacteria such as Methicillin-Resistant Staphylococcus aureus (MRSA). A promising antimicrobial peptide is ranalexin, which has potent activity against Gram-positive bacteria, and particularly S. aureus. Understanding mode of action is a key component of drug discovery and network biology approaches enable a global, integrated view of microbial physiology, including mechanisms of antibiotic killing. We developed a systems-wide functional association network approach to integrate proteome and transcriptome profiles, enabling study of drug resistance and mode of action.
The functional association network was constructed by Bayesian logistic regression, providing a framework for identification of antimicrobial peptide (ranalexin) response modules from S. aureus MRSA-252 transcriptome and proteome profiling. These signatures of ranalexin treatment revealed multiple killing mechanisms, including cell wall activity. Cell wall effects were supported by gene disruption and osmotic fragility experiments. Furthermore, twenty-two novel virulence factors were inferred, while the VraRS two-component system and PhoU-mediated persister formation were implicated in MRSA tolerance to cationic antimicrobial peptides.
This work demonstrates a powerful integrative approach to study drug resistance and mode of action. Our findings are informative to the development of novel therapeutic strategies against Staphylococcus aureus and particularly MRSA.
PMCID: PMC3123200  PMID: 21569391
20.  The SWI/SNF complex acts to constrain distribution of the centromeric histone variant Cse4 
The EMBO Journal  2011;30(10):1919-1927.
The SWI/SNF complex acts to constrain distribution of the centromeric histone variant Cse4
The SWI/SNF complex has an important role in regulating chromatin structure during transcriptional activation and DNA repair. Here, the SWI/SNF complex is also involved in the organisation of centromeric chromatin and prevention of the ectopic deposition of centromeric histone variants.
In order to gain insight into the function of the Saccharomyces cerevisiae SWI/SNF complex, we have identified DNA sequences to which it is bound genomewide. One surprising observation is that the complex is enriched at the centromeres of each chromosome. Deletion of the gene encoding the Snf2 subunit of the complex was found to cause partial redistribution of the centromeric histone variant Cse4 to sites on chromosome arms. Cultures of snf2Δ yeast were found to progress through mitosis slowly. This was dependent on the mitotic checkpoint protein Mad2. In the absence of Mad2, defects in chromosome segregation were observed. In the absence of Snf2, chromatin organisation at centromeres is less distinct. In particular, hypersensitive sites flanking the Cse4 containing nucleosomes are less pronounced. Furthermore, SWI/SNF complex was found to be especially effective in the dissociation of Cse4 containing chromatin in vitro. This suggests a role for Snf2 in the maintenance of point centromeres involving the removal of Cse4 from ectopic sites.
PMCID: PMC3098484  PMID: 21505420
centromere; chromatin; Cse4; nucleosome; SWI/SNF
21.  PIMS sequencing extension: a laboratory information management system for DNA sequencing facilities 
BMC Research Notes  2011;4:48.
Facilities that provide a service for DNA sequencing typically support large numbers of users and experiment types. The cost of services is often reduced by the use of liquid handling robots but the efficiency of such facilities is hampered because the software for such robots does not usually integrate well with the systems that run the sequencing machines. Accordingly, there is a need for software systems capable of integrating different robotic systems and managing sample information for DNA sequencing services. In this paper, we describe an extension to the Protein Information Management System (PIMS) that is designed for DNA sequencing facilities. The new version of PIMS has a user-friendly web interface and integrates all aspects of the sequencing process, including sample submission, handling and tracking, together with capture and management of the data.
The PIMS sequencing extension has been in production since July 2009 at the University of Leeds DNA Sequencing Facility. It has completely replaced manual data handling and simplified the tasks of data management and user communication. Samples from 45 groups have been processed with an average throughput of 10000 samples per month. The current version of the PIMS sequencing extension works with Applied Biosystems 3130XL 96-well plate sequencer and MWG 4204 or Aviso Theonyx liquid handling robots, but is readily adaptable for use with other combinations of robots.
PIMS has been extended to provide a user-friendly and integrated data management solution for DNA sequencing facilities that is accessed through a normal web browser and allows simultaneous access by multiple users as well as facility managers. The system integrates sequencing and liquid handling robots, manages the data flow, and provides remote access to the sequencing results. The software is freely available, for academic users, from
PMCID: PMC3058032  PMID: 21385349
22.  PNAC: a protein nucleolar association classifier 
BMC Genomics  2011;12:74.
Although primarily known as the site of ribosome subunit production, the nucleolus is involved in numerous and diverse cellular processes. Recent large-scale proteomics projects have identified thousands of human proteins that associate with the nucleolus. However, in most cases, we know neither the fraction of each protein pool that is nucleolus-associated nor whether their association is permanent or conditional.
To describe the dynamic localisation of proteins in the nucleolus, we investigated the extent of nucleolar association of proteins by first collating an extensively curated literature-derived dataset. This dataset then served to train a probabilistic predictor which integrates gene and protein characteristics. Unlike most previous experimental and computational studies of the nucleolar proteome that produce large static lists of nucleolar proteins regardless of their extent of nucleolar association, our predictor models the fluidity of the nucleolus by considering different classes of nucleolar-associated proteins. The new method predicts all human proteins as either nucleolar-enriched, nucleolar-nucleoplasmic, nucleolar-cytoplasmic or non-nucleolar. Leave-one-out cross validation tests reveal sensitivity values for these four classes ranging from 0.72 to 0.90 and positive predictive values ranging from 0.63 to 0.94. The overall accuracy of the classifier was measured to be 0.85 on an independent literature-based test set and 0.74 using a large independent quantitative proteomics dataset. While the three nucleolar-association groups display vastly different Gene Ontology biological process signatures and evolutionary characteristics, they collectively represent the most well characterised nucleolar functions.
Our proteome-wide classification of nucleolar association provides a novel representation of the dynamic content of the nucleolus. This model of nucleolar localisation thus increases the coverage while providing accurate and specific annotations of the nucleolar proteome. It will be instrumental in better understanding the central role of the nucleolus in the cell and its interaction with other subcellular compartments.
PMCID: PMC3038921  PMID: 21272300
23.  Identification of human miRNA precursors that resemble box C/D snoRNAs 
Nucleic Acids Research  2011;39(9):3879-3891.
There are two main classes of small nucleolar RNAs (snoRNAs): the box C/D snoRNAs and the box H/ACA snoRNAs that function as guide RNAs to direct sequence-specific modification of rRNA precursors and other nucleolar RNA targets. A previous computational and biochemical analysis revealed a possible evolutionary relationship between miRNA precursors and some box H/ACA snoRNAs. Here, we investigate a similar evolutionary relationship between a subset of miRNA precursors and box C/D snoRNAs. Computational analyses identified 84 intronic miRNAs that are encoded within either box C/D snoRNAs, or in precursors showing similarity to box C/D snoRNAs. Predictions of the folded structures of these box C/D snoRNA-like miRNA precursors resemble the structures of known box C/D snoRNAs, with the boxes C and D often in close proximity in the folded molecule. All five box C/D snoRNA-like miRNA precursors tested (miR-27b, miR-16-1, mir-28, miR-31 and let-7g) bind to fibrillarin, a specific protein component of functional box C/D snoRNP complexes. The data suggest that a subset of small regulatory RNAs may have evolved from box C/D snoRNAs.
PMCID: PMC3089480  PMID: 21247878
24.  Characterization and prediction of protein nucleolar localization sequences 
Nucleic Acids Research  2010;38(21):7388-7399.
Although the nucleolar localization of proteins is often believed to be mediated primarily by non-specific retention to core nucleolar components, many examples of short nucleolar targeting sequences have been reported in recent years. In this article, 46 human nucleolar localization sequences (NoLSs) were collated from the literature and subjected to statistical analysis. Of the residues in these NoLSs 48% are basic, whereas 99% of the residues are predicted to be solvent-accessible with 42% in α-helix and 57% in coil. The sequence and predicted protein secondary structure of the 46 NoLSs were used to train an artificial neural network to identify NoLSs. At a true positive rate of 54%, the predictor’s overall false positive rate (FPR) is estimated to be 1.52%, which can be broken down to FPRs of 0.26% for randomly chosen cytoplasmic sequences, 0.80% for randomly chosen nucleoplasmic sequences and 12% for nuclear localization signals. The predictor was used to predict NoLSs in the complete human proteome and 10 of the highest scoring previously unknown NoLSs were experimentally confirmed. NoLSs are a prevalent type of targeting motif that is distinct from nuclear localization signals and that can be computationally predicted.
PMCID: PMC2995072  PMID: 20663773
25.  Analysis of Human Small Nucleolar RNAs (snoRNA) and the Development of snoRNA Modulator of Gene Expression Vectors 
Molecular Biology of the Cell  2010;21(9):1569-1584.
In this manuscript we describe the characterisation of human snoRNAs that co-purify with nucleoli and develop a new vector based system for targeted gene knock down. We demonstrate that this novel vector system (snoMEN) can deliver effective, sequence-specific knock down of endogenous cellular genes as well as GFP and GFP-fusion proteins.
Human small nucleolar RNAs (snoRNAs) that copurify with nucleoli isolated from HeLa cells have been characterized. Novel fibrillarin-associated snoRNAs were detected that allowed the creation of a new vector system for the targeted knockdown of one or more genes in mammalian cells. The snoMEN (snoRNA modulator of gene expressioN) vector technology is based on snoRNA HBII-180C, which contains an internal sequence that can be manipulated to make it complementary to RNA targets. Gene-specific knockdowns are demonstrated for endogenous cellular proteins and for G/YFP-fusion proteins. Multiplex snoMEN vectors coexpress multiple snoRNAs in one transcript, targeted either to different genes or to different sites in the same gene. Protein replacement snoMEN vectors can express a single transcript combining cDNA for a tagged protein with introns containing cognate snoRNAs targeted to knockdown the endogenous cellular protein. We foresee applications for snoMEN vectors in basic gene expression research, target validation, and gene therapy.
PMCID: PMC2861615  PMID: 20219969

