Search tips
Search criteria

Results 1-25 (33)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
more »
Document Types
1.  Systematic exploration of guide-tree topology effects for small protein alignments 
BMC Bioinformatics  2014;15(1):338.
Guide-trees are used as part of an essential heuristic to enable the calculation of multiple sequence alignments. They have been the focus of much method development but there has been little effort at determining systematically, which guide-trees, if any, give the best alignments. Some guide-tree construction schemes are based on pair-wise distances amongst unaligned sequences. Others try to emulate an underlying evolutionary tree and involve various iteration methods.
We explore all possible guide-trees for a set of protein alignments of up to eight sequences. We find that pairwise distance based default guide-trees sometimes outperform evolutionary guide-trees, as measured by structure derived reference alignments. However, default guide-trees fall way short of the optimum attainable scores. On average chained guide-trees perform better than balanced ones but are not better than default guide-trees for small alignments.
Alignment methods that use Consistency or hidden Markov models to make alignments are less susceptible to sub-optimal guide-trees than simpler methods, that basically use conventional sequence alignment between profiles. The latter appear to be affected positively by evolutionary based guide-trees for difficult alignments and negatively for easy alignments. One phylogeny aware alignment program can strongly discriminate between good and bad guide-trees. The results for randomly chained guide-trees improve with the number of sequences.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2105-15-338) contains supplementary material, which is available to authorized users.
PMCID: PMC4287568  PMID: 25282640
Multiple sequence alignment; Guide-tree topology; Alignment accuracy; Benchmarking
2.  Genes and signaling networks regulated during zebrafish optic vesicle morphogenesis 
BMC Genomics  2014;15(1):825.
The genetic cascades underpinning vertebrate early eye morphogenesis are poorly understood. One gene family essential for eye morphogenesis encodes the retinal homeobox (Rx) transcription factors. Mutations in the human retinal homeobox gene (RAX) can lead to gross morphological phenotypes ranging from microphthalmia to anophthalmia. Zebrafish rx3 null mutants produce a similar striking eyeless phenotype with an associated expanded forebrain. Thus, we used zebrafish rx3-/- mutants as a model to uncover an Rx3-regulated gene network during early eye morphogenesis.
Rx3-regulated genes were identified using whole transcriptomic sequencing (RNA-seq) of rx3-/- mutants and morphologically wild-type siblings during optic vesicle morphogenesis. A gene co-expression network was then constructed for the Rx3-regulated genes, identifying gene cross-talk during early eye development. Genes highly connected in the network are hub genes, which tend to exhibit higher expression changes between rx3-/- mutants and normal phenotype siblings. Hub genes down-regulated in rx3-/- mutants encompass homeodomain transcription factors and mediators of retinoid-signaling, both associated with eye development and known human eye disorders. In contrast, genes up-regulated in rx3-/- mutants are centered on Wnt signaling pathways, associated with brain development and disorders. The temporal expression pattern of Rx3-regulated genes was further profiled during early development from maternal stage until visual function is fully mature. Rx3-regulated genes exhibited synchronized expression patterns, and a transition of gene expression during the early segmentation stage when Rx3 was highly expressed. Furthermore, most of these deregulated genes are enriched with multiple RAX-binding motif sequences on the gene promoter.
Here, we assembled a comprehensive model of Rx3-regulated genes during early eye morphogenesis. Rx3 promotes optic vesicle morphogenesis and represses brain development through a highly correlated and modulated network, exhibiting repression of genes mediating Wnt signaling and concomitant enhanced expression of homeodomain transcription factors and retinoid-signaling genes.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-825) contains supplementary material, which is available to authorized users.
PMCID: PMC4190348  PMID: 25266257
3.  Comparative Phenotypic Analysis of the Major Fungal Pathogens Candida parapsilosis and Candida albicans 
PLoS Pathogens  2014;10(9):e1004365.
Candida parapsilosis and Candida albicans are human fungal pathogens that belong to the CTG clade in the Saccharomycotina. In contrast to C. albicans, relatively little is known about the virulence properties of C. parapsilosis, a pathogen particularly associated with infections of premature neonates. We describe here the construction of C. parapsilosis strains carrying double allele deletions of 100 transcription factors, protein kinases and species-specific genes. Two independent deletions were constructed for each target gene. Growth in >40 conditions was tested, including carbon source, temperature, and the presence of antifungal drugs. The phenotypes were compared to C. albicans strains with deletions of orthologous transcription factors. We found that many phenotypes are shared between the two species, such as the role of Upc2 as a regulator of azole resistance, and of CAP1 in the oxidative stress response. Others are unique to one species. For example, Cph2 plays a role in the hypoxic response in C. parapsilosis but not in C. albicans. We found extensive divergence between the biofilm regulators of the two species. We identified seven transcription factors and one protein kinase that are required for biofilm development in C. parapsilosis. Only three (Efg1, Bcr1 and Ace2) have similar effects on C. albicans biofilms, whereas Cph2, Czf1, Gzf3 and Ume6 have major roles in C. parapsilosis only. Two transcription factors (Brg1 and Tec1) with well-characterized roles in biofilm formation in C. albicans do not have the same function in C. parapsilosis. We also compared the transcription profile of C. parapsilosis and C. albicans biofilms. Our analysis suggests the processes shared between the two species are predominantly metabolic, and that Cph2 and Bcr1 are major biofilm regulators in C. parapsilosis.
Author Summary
Candida species are among the most common causes of fungal infection worldwide. Infections can be both community-based and hospital-acquired, and are particularly associated with immunocompromised individuals. Candida albicans is the most commonly isolated species and is the best studied. However, other species are becoming of increasing concern. Candida parapsilosis causes outbreaks of infection in neonatal wards, and is one of the few Candida species that is transferred from the hands of healthcare workers. C. parapsilosis, like C. albicans, grows as biofilms (cell communities) on the surfaces of indwelling medical devices like feeding tubes. We describe here the construction of a set of tools that allow us to characterize the virulence properties of C. parapsilosis, and in particular its ability to grow as biofilms. We find that some of the regulatory mechanisms are shared with C. albicans, but others are unique to each species. Our tools, based on selectively deleting regulatory genes, will provide a major resource to the fungal research community.
PMCID: PMC4169492  PMID: 25233198
4.  Loss of Olfactory Receptor Function in Hominin Evolution 
PLoS ONE  2014;9(1):e84714.
The mammalian sense of smell is governed by the largest gene family, which encodes the olfactory receptors (ORs). The gain and loss of OR genes is typically correlated with adaptations to various ecological niches. Modern humans have 853 OR genes but 55% of these have lost their function. Here we show evidence of additional OR loss of function in the Neanderthal and Denisovan hominin genomes using comparative genomic methodologies. Ten Neanderthal and 8 Denisovan ORs show evidence of loss of function that differ from the reference modern human OR genome. Some of these losses are also present in a subset of modern humans, while some are unique to each lineage. Morphological changes in the cranium of Neanderthals suggest different sensory arrangements to that of modern humans. We identify differences in functional olfactory receptor genes among modern humans, Neanderthals and Denisovans, suggesting varied loss of function across all three taxa and we highlight the utility of using genomic information to elucidate the sensory niches of extinct species.
PMCID: PMC3879314  PMID: 24392153
5.  GWIPS-viz: development of a ribo-seq genome browser 
Nucleic Acids Research  2013;42(Database issue):D859-D864.
We describe the development of GWIPS-viz (, an online genome browser for viewing ribosome profiling data. Ribosome profiling (ribo-seq) is a recently developed technique that provides genome-wide information on protein synthesis (GWIPS) in vivo. It is based on the deep sequencing of ribosome-protected messenger RNA (mRNA) fragments, which allows the ribosome density along all mRNA transcripts present in the cell to be quantified. Since its inception, ribo-seq has been carried out in a number of eukaryotic and prokaryotic organisms. Owing to the increasing interest in ribo-seq, there is a pertinent demand for a dedicated ribo-seq genome browser. GWIPS-viz is based on The University of California Santa Cruz (UCSC) Genome Browser. Ribo-seq tracks, coupled with mRNA-seq tracks, are currently available for several genomes: human, mouse, zebrafish, nematode, yeast, bacteria (Escherichia coli K12, Bacillus subtilis), human cytomegalovirus and bacteriophage lambda. Our objective is to continue incorporating published ribo-seq data sets so that the wider community can readily view ribosome profiling information from multiple studies without the need to carry out computational processing.
PMCID: PMC3965066  PMID: 24185699
6.  The Cu regulon of the human fungal pathogen Cryptococcus neoformans H99: Cuf1 activates distinct genes in response to both Cu excess and deficiency 
Molecular microbiology  2011;81(6):1560-1576.
Cryptococcus neoformans is a human fungal pathogen that is the causative agent of cryptococcosis and fatal meningitis in immuno-compromised hosts. Recent studies suggest that copper (Cu) acquisition plays an important role in C. neoformans virulence, as mutants that lack Cuf1, which activates the Ctr4 high affinity Cu importer, are hypo-virulent in mouse models. To understand the constellation of Cu-responsive genes in C. neoformans and how their expression might contribute to virulence, we determined the transcript profile of C. neoformans in response to elevated Cu or Cu deficiency. We identified two metallothionein genes (CMT1 and CMT2), encoding cysteine-rich Cu binding and detoxifying proteins, whose expression is dramatically elevated in response to excess Cu. We identified a new C. neoformans Cu transporter, CnCtr1, that is induced by Cu deficiency and is distinct from CnCtr4 and which shows significant phylogenetic relationship to Ctr1 from other fungi. Surprisingly, in contrast to other fungal, we found that induction of CnCTR1 and CnCTR4 expression under Cu limitation, and CMT1 and CMT2 in response to Cu excess, are dependent on the CnCuf1 Cu metalloregulatory transcription factor. These studies set the stage for the evaluation of the specific Cuf1 target genes required for virulence in C. neoformans.
PMCID: PMC3718005  PMID: 21819456
7.  Inhibition of the Pim1 Oncogene Results in Diminished Visual Function 
PLoS ONE  2012;7(12):e52177.
Our objective was to profile genetic pathways whose differential expression correlates with maturation of visual function in zebrafish. Bioinformatic analysis of transcriptomic data revealed Jak-Stat signalling as the pathway most enriched in the eye, as visual function develops. Real-time PCR, western blotting, immunohistochemistry and in situ hybridization data confirm that multiple Jak-Stat pathway genes are up-regulated in the zebrafish eye between 3–5 days post-fertilisation, times associated with significant maturation of vision. One of the most up-regulated Jak-Stat genes is the proto-oncogene Pim1 kinase, previously associated with haematological malignancies and cancer. Loss of function experiments using Pim1 morpholinos or Pim1 inhibitors result in significant diminishment of visual behaviour and function. In summary, we have identified that enhanced expression of Jak-Stat pathway genes correlates with maturation of visual function and that the Pim1 oncogene is required for normal visual function.
PMCID: PMC3530609  PMID: 23300608
8.  Subgenotyping of Genotype C Hepatitis B Virus: Correcting Misclassifications and Identifying a Novel Subgenotype 
PLoS ONE  2012;7(10):e47271.
More than ten subgenotypes of genotype C Hepatitis B virus (HBV) have been reported, including C1 to C16 and two C/D recombinant subgenotypes (CD1 and CD2), however, inconsistent designations of these subgenotypes still exist.
Methodology/Principal Findings
We performed a phylogenetic analysis of all full-length genotype C HBV genome sequences to correct the misclassifications of HBV subgenotypes and to study the influence of recombination on HBV subgenotyping. Our results showed that although inclusion of the recombinant sequences changed the topology of the phylogenetic tree, it did not affect the subgenotyping of the non-recombinant sequences, except subgenotype C2. In addition, most of the subgenotypes have been properly designated. However, several misclassifications of HBV subgenotypes have been identified and corrected. For example, C11 proposed by Utsumi and colleagues in 2011 was found to be grouped with C12 proposed by Mulyanto and colleagues. Two sequences, GQ358157 and GU721029, previously designated as C6 have been re-designated as C12 and C7, respectively. Moreover, a quasi-subgenotype C2 was proposed, which included the old C2, several previously unclassified sequences and previously designated C14. In particular, we identified a novel subgenotype, tentative C14, which was well supported by phylogenetic analysis and sequence divergence of >4%.
A number of misclassifications in the subgenotyping of genotype C HBV have been identified in this study. After correcting the misclassifications, we proposed a better classification for the subgenotyping of genotype C HBV, in which a novel quasi-subgenotype C2 and a novel subgenotype, tentative C14, were described. Based on this large-scale analysis, we propose that a novel subgenotype should only be reported after a complete comparison of all relevant sequences rather than a few representative sequences only.
PMCID: PMC3471840  PMID: 23077582
9.  Subgenotype reclassification of genotype B hepatitis B virus 
BMC Gastroenterology  2012;12:116.
Nine subgenotypes from genotype B have been identified for hepatitis B virus (HBV). However, these subgenotypes were less conclusive as they were often designated based on a few representative strains. In addition, subgenotype B6 was designated twice for viruses of different origin.
All complete genome sequences of genotype B HBV were phylogenetically analyzed. Sequence divergences between different potential subgenotypes were also assessed.
Both phylogenetic and sequence divergence analyses supported the designation of subgenotypes B1, B2, B4, and B6 (from Arctic). However, sequence divergences between previously designated B3, B5, B7, B8, B9 and another B6 (from China) were mostly less than 4%. In addition, subgenotype B3 did not form a monophyly.
Current evidence failed to classify original B5, B7, B8, B9, and B6 (from China) as subgenotypes. Instead, they could be considered as a quasi-subgenotype B3 of Southeast Asian and Chinese origin. In addition, previously designated B6 (from Arctic) should be renamed as B5 for continuous numbering. This novel classification is well supported by both the phylogeny and sequence divergence of > 4%.
PMCID: PMC3523008  PMID: 22925657
Hepatitis B virus; Subgenotype; Phylogenetic analysis; Sequence divergence
10.  Recombination in Hepatitis C Virus: Identification of Four Novel Naturally Occurring Inter-Subtype Recombinants 
PLoS ONE  2012;7(7):e41997.
Recombination in Hepatitis C virus (HCV) is considered to be rare. In this study, we performed a phylogenetic analysis of 1278 full-length HCV genome sequences to identify potential recombination events. Nine inter-genotype recombinants were identified, all of which have been previously reported. This confirms the rarity of inter-genotype HCV recombinants. The analysis also identified five inter-subtype recombinants, four of which are documented for the first time (EU246930, EU246931, EU246932, and EU246937). Specifically, the latter represent four different novel recombination types (6a/6o, 6e/6o, 6e/6h, and 6n/6o), and this was well supported by seven independent methods embedded in RDP. The breakpoints of the four novel HCV recombinants are located within the NS5B coding region and were different from all previously reported breakpoints. While the locations of the breakpoints identified by RDP were not identical, they are very close. Our study suggests that while recombination in HCV is rare, this warrants further investigation.
PMCID: PMC3404033  PMID: 22911872
11.  Using RNA-seq to determine the transcriptional landscape and the hypoxic response of the pathogenic yeast Candida parapsilosis 
BMC Genomics  2011;12:628.
Candida parapsilosis is one of the most common causes of Candida infection worldwide. However, the genome sequence annotation was made without experimental validation and little is known about the transcriptional landscape. The transcriptional response of C. parapsilosis to hypoxic (low oxygen) conditions, such as those encountered in the host, is also relatively unexplored.
We used next generation sequencing (RNA-seq) to determine the transcriptional profile of C. parapsilosis growing in several conditions including different media, temperatures and oxygen concentrations. We identified 395 novel protein-coding sequences that had not previously been annotated. We removed > 300 unsupported gene models, and corrected approximately 900. We mapped the 5' and 3' UTR for thousands of genes. We also identified 422 introns, including two introns in the 3' UTR of one gene. This is the first report of 3' UTR introns in the Saccharomycotina. Comparing the introns in coding sequences with other species shows that small numbers have been gained and lost throughout evolution. Our analysis also identified a number of novel transcriptional active regions (nTARs). We used both RNA-seq and microarray analysis to determine the transcriptional profile of cells grown in normoxic and hypoxic conditions in rich media, and we showed that there was a high correlation between the approaches. We also generated a knockout of the UPC2 transcriptional regulator, and we found that similar to C. albicans, Upc2 is required for conferring resistance to azole drugs, and for regulation of expression of the ergosterol pathway in hypoxia.
We provide the first detailed annotation of the C. parapsilosis genome, based on gene predictions and transcriptional analysis. We identified a number of novel ORFs and other transcribed regions, and detected transcripts from approximately 90% of the annotated protein coding genes. We found that the transcription factor Upc2 role has a conserved role as a major regulator of the hypoxic response in C. parapsilosis and C. albicans.
PMCID: PMC3287387  PMID: 22192698
Transcriptional profiling, pathogenesis, RNA-seq, Candida
12.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega 
Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega
Multiple sequence alignments are fundamental to many sequence analysis methods. The new program Clustal Omega can align virtually any number of protein sequences quickly and has powerful features for adding sequences to existing precomputed alignments.
Multiple sequence alignments are fundamental to many sequence analysis methods. Most alignments are computed using the progressive alignment heuristic. These methods are starting to become a bottleneck in some analysis pipelines when faced with data sets of the size of many thousands of sequences. Some methods allow computation of larger data sets while sacrificing quality, and others produce high-quality alignments, but scale badly with the number of sequences. In this paper, we describe a new program called Clustal Omega, which can align virtually any number of protein sequences quickly and that delivers accurate alignments. The accuracy of the package on smaller test cases is similar to that of the high-quality aligners. On larger data sets, Clustal Omega outperforms other packages in terms of execution time and quality. Clustal Omega also has powerful features for adding sequences to and exploiting information in existing alignments, making use of the vast amount of precomputed information in public databases like Pfam.
PMCID: PMC3261699  PMID: 21988835
bioinformatics; hidden Markov models; multiple sequence alignment
13.  Regulation of the Hypoxic Response in Candida albicans ▿ † 
Eukaryotic Cell  2010;9(11):1734-1746.
The regulation of the response of Candida albicans to hypoxic (low-oxygen) conditions is poorly understood. We used microarray and other transcriptional analyses to investigate the role of the Upc2 and Bcr1 transcription factors in controlling expression of genes involved in cell wall metabolism, ergosterol synthesis, and glycolysis during adaptation to hypoxia. Hypoxic induction of the ergosterol pathway is mimicked by treatment with sterol-lowering drugs (ketoconazole) and requires UPC2. Expression of three members of the family CFEM (common in several fungal extracellular membranes) of cell wall genes (RBT5, PGA7, and PGA10) is also induced by hypoxia and ketoconazole and requires both UPC2 and BCR1. Expression of glycolytic genes is induced by hypoxia but not by treatment with sterol-lowering drugs, whereas expression of respiratory pathway genes is repressed. However, Upc2 does not play a major role in regulating expression of genes required for central carbon metabolism. Our results indicate that regulation of gene expression in response to hypoxia in C. albicans is complex and is signaled both via lowered sterol levels and other unstudied mechanisms. We also show that induction of filamentation under hypoxic conditions requires the Ras1- and Cdc35-dependent pathway.
PMCID: PMC2976306  PMID: 20870877
14.  A Complete Analysis of HA and NA Genes of Influenza A Viruses 
PLoS ONE  2010;5(12):e14454.
More and more nucleotide sequences of type A influenza virus are available in public databases. Although these sequences have been the focus of many molecular epidemiological and phylogenetic analyses, most studies only deal with a few representative sequences. In this paper, we present a complete analysis of all Haemagglutinin (HA) and Neuraminidase (NA) gene sequences available to allow large scale analyses of the evolution and epidemiology of type A influenza.
Methodology/Principal Findings
This paper describes an analysis and complete classification of all HA and NA gene sequences available in public databases using multivariate and phylogenetic methods.
We analyzed 18975 HA sequences and divided them into 280 subgroups according to multivariate and phylogenetic analyses. Similarly, we divided 11362 NA sequences into 202 subgroups. Compared to previous analyses, this work is more detailed and comprehensive, especially for the bigger datasets. Therefore, it can be used to show the full and complex phylogenetic diversity and provides a framework for studying the molecular evolution and epidemiology of type A influenza virus. For more than 85% of type A influenza HA and NA sequences into GenBank, they are categorized in one unambiguous and unique group. Therefore, our results are a kind of genetic and phylogenetic annotation for influenza HA and NA sequences. In addition, sequences of swine influenza viruses come from 56 HA and 45 NA subgroups. Most of these subgroups also include viruses from other hosts indicating cross species transmission of the viruses between pigs and other hosts. Furthermore, the phylogenetic diversity of swine influenza viruses from Eurasia is greater than that of North American strains and both of them are becoming more diverse. Apart from viruses from human, pigs, birds and horses, viruses from other species show very low phylogenetic diversity. This might indicate that viruses have not become established in these species. Based on current evidence, there is no simple pattern of inter-hemisphere transmission of avian influenza viruses and it appears to happen sporadically. However, for H6 subtype avian influenza viruses, such transmissions might have happened very frequently and multiple and bidirectional transmission events might exist.
PMCID: PMC3012125  PMID: 21209922
15.  Ensemble approach combining multiple methods improves human transcription start site prediction 
BMC Genomics  2010;11:677.
The computational prediction of transcription start sites is an important unsolved problem. Some recent progress has been made, but many promoters, particularly those not associated with CpG islands, are still difficult to locate using current methods. These methods use different features and training sets, along with a variety of machine learning techniques and result in different prediction sets.
We demonstrate the heterogeneity of current prediction sets, and take advantage of this heterogeneity to construct a two-level classifier ('Profisi Ensemble') using predictions from 7 programs, along with 2 other data sources. Support vector machines using 'full' and 'reduced' data sets are combined in an either/or approach. We achieve a 14% increase in performance over the current state-of-the-art, as benchmarked by a third-party tool.
Supervised learning methods are a useful way to combine predictions from diverse sources.
PMCID: PMC3053590  PMID: 21118509
16.  Detecting microRNA activity from gene expression data 
BMC Bioinformatics  2010;11:257.
MicroRNAs (miRNAs) are non-coding RNAs that regulate gene expression by binding to the messenger RNA (mRNA) of protein coding genes. They control gene expression by either inhibiting translation or inducing mRNA degradation. A number of computational techniques have been developed to identify the targets of miRNAs. In this study we used predicted miRNA-gene interactions to analyse mRNA gene expression microarray data to predict miRNAs associated with particular diseases or conditions.
Here we combine correspondence analysis, between group analysis and co-inertia analysis (CIA) to determine which miRNAs are associated with differences in gene expression levels in microarray data sets. Using a database of miRNA target predictions from TargetScan, TargetScanS, PicTar4way PicTar5way, and miRanda and combining these data with gene expression levels from sets of microarrays, this method produces a ranked list of miRNAs associated with a specified split in samples. We applied this to three different microarray datasets, a papillary thyroid carcinoma dataset, an in-house dataset of lipopolysaccharide treated mouse macrophages, and a multi-tissue dataset. In each case we were able to identified miRNAs of biological importance.
We describe a technique to integrate gene expression data and miRNA target predictions from multiple sources.
PMCID: PMC2885376  PMID: 20482775
17.  Sequence embedding for fast construction of guide trees for multiple sequence alignment 
The most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to N2 for N sequences. When N grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments.
In this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances.
We show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from
PMCID: PMC2893182  PMID: 20470396
18.  Integrating multiple genome annotation databases improves the interpretation of microarray gene expression data 
BMC Genomics  2010;11:50.
The Affymetrix GeneChip is a widely used gene expression profiling platform. Since the chips were originally designed, the genome databases and gene definitions have been considerably updated. Thus, more accurate interpretation of microarray data requires parallel updating of the specificity of GeneChip probes. We propose a new probe remapping protocol, using the zebrafish GeneChips as an example, by removing nonspecific probes, and grouping the probes into transcript level probe sets using an integrated zebrafish genome annotation. This genome annotation is based on combining transcript information from multiple databases. This new remapping protocol, especially the new genome annotation, is shown here to be an important factor in improving the interpretation of gene expression microarray data.
Transcript data from the RefSeq, GenBank and Ensembl databases were downloaded from the UCSC genome browser, and integrated to generate a combined zebrafish genome annotation. Affymetrix probes were filtered and remapped according to the new annotation. The influence of transcript collection and gene definition methods was tested using two microarray data sets. Compared to remapping using a single database, this new remapping protocol results in up to 20% more probes being retained in the remapping, leading to approximately 1,000 more genes being detected. The differentially expressed gene lists are consequently increased by up to 30%. We are also able to detect up to three times more alternative splicing events. A small number of the bioinformatics predictions were confirmed using real-time PCR validation.
By combining gene definitions from multiple databases, it is possible to greatly increase the numbers of genes and splice variants that can be detected in microarray gene expression experiments.
PMCID: PMC2827411  PMID: 20089164
19.  Widespread Dysregulation of MiRNAs by MYCN Amplification and Chromosomal Imbalances in Neuroblastoma: Association of miRNA Expression with Survival 
PLoS ONE  2009;4(11):e7850.
MiRNAs regulate gene expression at a post-transcriptional level and their dysregulation can play major roles in the pathogenesis of many different forms of cancer, including neuroblastoma, an often fatal paediatric cancer originating from precursor cells of the sympathetic nervous system. We have analyzed a set of neuroblastoma (n = 145) that is broadly representative of the genetic subtypes of this disease for miRNA expression (430 loci by stem-loop RT qPCR) and for DNA copy number alterations (array CGH) to assess miRNA involvement in disease pathogenesis. The tumors were stratified and then randomly split into a training set (n = 96) and a validation set (n = 49) for data analysis. Thirty-seven miRNAs were significantly over- or under-expressed in MYCN amplified tumors relative to MYCN single copy tumors, indicating a potential role for the MYCN transcription factor in either the direct or indirect dysregulation of these loci. In addition, we also determined that there was a highly significant correlation between miRNA expression levels and DNA copy number, indicating a role for large-scale genomic imbalances in the dysregulation of miRNA expression. In order to directly assess whether miRNA expression was predictive of clinical outcome, we used the Random Forest classifier to identify miRNAs that were most significantly associated with poor overall patient survival and developed a 15 miRNA signature that was predictive of overall survival with 72.7% sensitivity and 86.5% specificity in the validation set of tumors. We conclude that there is widespread dysregulation of miRNA expression in neuroblastoma tumors caused by both over-expression of the MYCN transcription factor and by large-scale chromosomal imbalances. MiRNA expression patterns are also predicative of clinical outcome, highlighting the potential for miRNA mediated diagnostics and therapeutics.
PMCID: PMC2773120  PMID: 19924232
20.  High DNA melting temperature predicts transcription start site location in human and mouse 
Nucleic Acids Research  2009;37(22):7360-7367.
The accurate computational prediction of transcription start sites (TSS) in vertebrate genomes is a difficult problem. The physicochemical properties of DNA can be computed in various ways and a many combinations of DNA features have been tested in the past for use as predictors of transcription. We looked in detail at melting temperature, which measures the temperature, at which two strands of DNA separate, considering the cooperative nature of this process. We find that peaks in melting temperature correspond closely to experimentally determined transcription start sites in human and mouse chromosomes. Using melting temperature alone, and with simple thresholding, we can predict TSS with accuracy that is competitive with the most accurate state-of-the-art TSS prediction methods. Accuracy is measured using both experimentally and manually determined TSS. The method works especially well with CpG island containing promoters, but also works when CpG islands are absent. This result is clear evidence of the important role of the physical properties of DNA in the process of transcription. It also points to the importance for TSS prediction methods to include melting temperature as prior information.
PMCID: PMC2794178  PMID: 19820114
21.  Correlation between Biofilm Formation and the Hypoxic Response in Candida parapsilosis▿ †  
Eukaryotic Cell  2009;8(4):550-559.
The ability of Candida parapsilosis to form biofilms on indwelling medical devices is correlated with virulence. To identify genes that are important for biofilm formation, we used arrays representing approximately 4,000 open reading frames (ORFs) to compare the transcriptional profile of biofilm cells growing in a microfermentor under continuous flow conditions with that of cells in planktonic culture. The expression of genes involved in fatty acid and ergosterol metabolism and in glycolysis, is upregulated in biofilms. The transcriptional profile of C. parapsilosis biofilm cells resembles that of Candida albicans cells grown under hypoxic conditions. We therefore subsequently used whole-genome arrays (representing 5,900 ORFs) to determine the hypoxic response of C. parapsilosis and showed that the levels of expression of genes involved in the ergosterol and glycolytic pathways, together with several cell wall genes, are increased. Our results indicate that there is substantial overlap between the hypoxic responses of C. parapsilosis and C. albicans and that this may be important for biofilm development. Knocking out an ortholog of the cell wall gene RBT1, whose expression is induced both in biofilms and under conditions of hypoxia in C. parapsilosis, reduces biofilm development.
PMCID: PMC2669199  PMID: 19151323
22.  Hypoxia Selectively Activates the CREB Family of Transcription Factors in the In Vivo Lung 
Rationale: Pulmonary hypertension is a common complication of chronic hypoxic lung diseases and is associated with increased morbidity and reduced survival. The pulmonary vascular changes in response to hypoxia, both structural and functional, are unique to this circulation.
Objectives: To identify transcription factor pathways uniquely activated in the lung in response to hypoxia.
Methods: After exposure to environmental hypoxia (10% O2) for varying periods (3 h to 2 wk), lungs and systemic organs were isolated from groups of adult male mice. Bioinformatic examination of genes the expression of which changed in the hypoxic lung (assessed using microarray analysis) identified potential lung-selective transcription factors controlling these changes in gene expression. In separate further experiments, lung-selective activation of these candidate transcription factors was tested in hypoxic mice and by comparing hypoxic responses of primary human pulmonary and cardiac microvascular endothelial cells in vitro.
Measurements and Main Results: Bioinformatic analysis identified cAMP response element binding (CREB) family members as candidate lung-selective hypoxia-responsive transcription factors. Further in vivo experiments demonstrated activation of CREB and activating transcription factor (ATF)1 and up-regulation of CREB family–responsive genes in the hypoxic lung, but not in other organs. Hypoxia-dependent CREB activation and CREB-responsive gene expression was observed in human primary lung, but not cardiac microvascular endothelial cells.
Conclusions: These findings suggest that activation of CREB and AFT1 plays a key role in the lung-specific responses to hypoxia, and that lung microvascular endothelial cells are important, proximal effector cells in the specific responses of the pulmonary circulation to hypoxia.
PMCID: PMC2643223  PMID: 18689465
hypoxia; cAMP response element binding; pulmonary hypertension; transcription factor binding site
23.  R-Coffee: a web server for accurately aligning noncoding RNA sequences 
Nucleic Acids Research  2008;36(Web Server issue):W10-W13.
The R-Coffee web server produces highly accurate multiple alignments of noncoding RNA (ncRNA) sequences, taking into account predicted secondary structures. R-Coffee uses a novel algorithm recently incorporated in the T-Coffee package. R-Coffee works along the same lines as T-Coffee: it uses pairwise or multiple sequence alignment (MSA) methods to compute a primary library of input alignments. The program then computes an MSA highly consistent with both the alignments contained in the library and the secondary structures associated with the sequences. The secondary structures are predicted using RNAplfold. The server provides two modes. The slow/accurate mode is restricted to small datasets (less than 5 sequences less than 150 nucleotides) and combines R-Coffee with Consan, a very accurate pairwise RNA alignment method. For larger datasets a fast method can be used (RM-Coffee mode), that uses R-Coffee to combine the output of the three packages which combines the outputs from programs found to perform best on RNA (MUSCLE, MAFFT and ProbConsRNA). Our BRAliBase benchmarks indicate that the R-Coffee/Consan combination is one of the best ncRNA alignment methods for short sequences, while the RM-Coffee gives comparable results on longer sequences. The R-Coffee web server is available at
PMCID: PMC2447777  PMID: 18483080
24.  R-Coffee: a method for multiple alignment of non-coding RNA 
Nucleic Acids Research  2008;36(9):e52.
R-Coffee is a multiple RNA alignment package, derived from T-Coffee, designed to align RNA sequences while exploiting secondary structure information. R-Coffee uses an alignment-scoring scheme that incorporates secondary structure information within the alignment. It works particularly well as an alignment improver and can be combined with any existing sequence alignment method. In this work, we used R-Coffee to compute multiple sequence alignments combining the pairwise output of sequence aligners and structural aligners. We show that R-Coffee can improve the accuracy of all the sequence aligners. We also show that the consistency-based component of T-Coffee can improve the accuracy of several structural aligners. R-Coffee was tested on 388 BRAliBase reference datasets and on 11 longer Cmfinder datasets. Altogether our results suggest that the best protocol for aligning short sequences (less than 200 nt) is the combination of R-Coffee with the RNA pairwise structural aligner Consan. We also show that the simultaneous combination of the four best sequence alignment programs with R-Coffee produces alignments almost as accurate as those obtained with R-Coffee/Consan. Finally, we show that R-Coffee can also be used to align longer datasets beyond the usual scope of structural aligners. R-Coffee is freely available for download, along with documentation, from the T-Coffee web site (
PMCID: PMC2396437  PMID: 18420654
25.  The M-Coffee web server: a meta-method for computing multiple sequence alignments by combining alternative alignment methods 
Nucleic Acids Research  2007;35(Web Server issue):W645-W648.
The M-Coffee server is a web server that makes it possible to compute multiple sequence alignments (MSAs) by running several MSA methods and combining their output into one single model. This allows the user to simultaneously run all his methods of choice without having to arbitrarily choose one of them. The MSA is delivered along with a local estimation of its consistency with the individual MSAs it was derived from. The computation of the consensus multiple alignment is carried out using a special mode of the T-Coffee package [Notredame, Higgins and Heringa (T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 2000; 302: 205–217); Wallace, O'Sullivan, Higgins and Notredame (M-Coffee: combining multiple sequence alignment methods with T-Coffee. Nucleic Acids Res. 2006; 34: 1692–1699)] Given a set of sequences (DNA or proteins) in FASTA format, M-Coffee delivers a multiple alignment in the most common formats. M-Coffee is a freeware open source package distributed under a GPL license and it is available either as a standalone package or as a web service from
PMCID: PMC1933118  PMID: 17526519

Results 1-25 (33)