The mammalian sense of smell is governed by the largest gene family, which encodes the olfactory receptors (ORs). The gain and loss of OR genes is typically correlated with adaptations to various ecological niches. Modern humans have 853 OR genes but 55% of these have lost their function. Here we show evidence of additional OR loss of function in the Neanderthal and Denisovan hominin genomes using comparative genomic methodologies. Ten Neanderthal and 8 Denisovan ORs show evidence of loss of function that differ from the reference modern human OR genome. Some of these losses are also present in a subset of modern humans, while some are unique to each lineage. Morphological changes in the cranium of Neanderthals suggest different sensory arrangements to that of modern humans. We identify differences in functional olfactory receptor genes among modern humans, Neanderthals and Denisovans, suggesting varied loss of function across all three taxa and we highlight the utility of using genomic information to elucidate the sensory niches of extinct species.
We describe the development of GWIPS-viz (http://gwips.ucc.ie), an online genome browser for viewing ribosome profiling data. Ribosome profiling (ribo-seq) is a recently developed technique that provides genome-wide information on protein synthesis (GWIPS) in vivo. It is based on the deep sequencing of ribosome-protected messenger RNA (mRNA) fragments, which allows the ribosome density along all mRNA transcripts present in the cell to be quantified. Since its inception, ribo-seq has been carried out in a number of eukaryotic and prokaryotic organisms. Owing to the increasing interest in ribo-seq, there is a pertinent demand for a dedicated ribo-seq genome browser. GWIPS-viz is based on The University of California Santa Cruz (UCSC) Genome Browser. Ribo-seq tracks, coupled with mRNA-seq tracks, are currently available for several genomes: human, mouse, zebrafish, nematode, yeast, bacteria (Escherichia coli K12, Bacillus subtilis), human cytomegalovirus and bacteriophage lambda. Our objective is to continue incorporating published ribo-seq data sets so that the wider community can readily view ribosome profiling information from multiple studies without the need to carry out computational processing.
Cryptococcus neoformans is a human fungal pathogen that is the causative agent of cryptococcosis and fatal meningitis in immuno-compromised hosts. Recent studies suggest that copper (Cu) acquisition plays an important role in C. neoformans virulence, as mutants that lack Cuf1, which activates the Ctr4 high affinity Cu importer, are hypo-virulent in mouse models. To understand the constellation of Cu-responsive genes in C. neoformans and how their expression might contribute to virulence, we determined the transcript profile of C. neoformans in response to elevated Cu or Cu deficiency. We identified two metallothionein genes (CMT1 and CMT2), encoding cysteine-rich Cu binding and detoxifying proteins, whose expression is dramatically elevated in response to excess Cu. We identified a new C. neoformans Cu transporter, CnCtr1, that is induced by Cu deficiency and is distinct from CnCtr4 and which shows significant phylogenetic relationship to Ctr1 from other fungi. Surprisingly, in contrast to other fungal, we found that induction of CnCTR1 and CnCTR4 expression under Cu limitation, and CMT1 and CMT2 in response to Cu excess, are dependent on the CnCuf1 Cu metalloregulatory transcription factor. These studies set the stage for the evaluation of the specific Cuf1 target genes required for virulence in C. neoformans.
Nine subgenotypes from genotype B have been identified for hepatitis B virus (HBV). However, these subgenotypes were less conclusive as they were often designated based on a few representative strains. In addition, subgenotype B6 was designated twice for viruses of different origin.
All complete genome sequences of genotype B HBV were phylogenetically analyzed. Sequence divergences between different potential subgenotypes were also assessed.
Both phylogenetic and sequence divergence analyses supported the designation of subgenotypes B1, B2, B4, and B6 (from Arctic). However, sequence divergences between previously designated B3, B5, B7, B8, B9 and another B6 (from China) were mostly less than 4%. In addition, subgenotype B3 did not form a monophyly.
Current evidence failed to classify original B5, B7, B8, B9, and B6 (from China) as subgenotypes. Instead, they could be considered as a quasi-subgenotype B3 of Southeast Asian and Chinese origin. In addition, previously designated B6 (from Arctic) should be renamed as B5 for continuous numbering. This novel classification is well supported by both the phylogeny and sequence divergence of > 4%.
Hepatitis B virus; Subgenotype; Phylogenetic analysis; Sequence divergence
Candida parapsilosis is one of the most common causes of Candida infection worldwide. However, the genome sequence annotation was made without experimental validation and little is known about the transcriptional landscape. The transcriptional response of C. parapsilosis to hypoxic (low oxygen) conditions, such as those encountered in the host, is also relatively unexplored.
We used next generation sequencing (RNA-seq) to determine the transcriptional profile of C. parapsilosis growing in several conditions including different media, temperatures and oxygen concentrations. We identified 395 novel protein-coding sequences that had not previously been annotated. We removed > 300 unsupported gene models, and corrected approximately 900. We mapped the 5' and 3' UTR for thousands of genes. We also identified 422 introns, including two introns in the 3' UTR of one gene. This is the first report of 3' UTR introns in the Saccharomycotina. Comparing the introns in coding sequences with other species shows that small numbers have been gained and lost throughout evolution. Our analysis also identified a number of novel transcriptional active regions (nTARs). We used both RNA-seq and microarray analysis to determine the transcriptional profile of cells grown in normoxic and hypoxic conditions in rich media, and we showed that there was a high correlation between the approaches. We also generated a knockout of the UPC2 transcriptional regulator, and we found that similar to C. albicans, Upc2 is required for conferring resistance to azole drugs, and for regulation of expression of the ergosterol pathway in hypoxia.
We provide the first detailed annotation of the C. parapsilosis genome, based on gene predictions and transcriptional analysis. We identified a number of novel ORFs and other transcribed regions, and detected transcripts from approximately 90% of the annotated protein coding genes. We found that the transcription factor Upc2 role has a conserved role as a major regulator of the hypoxic response in C. parapsilosis and C. albicans.
Transcriptional profiling, pathogenesis, RNA-seq, Candida
Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega
Multiple sequence alignments are fundamental to many sequence analysis methods. The new program Clustal Omega can align virtually any number of protein sequences quickly and has powerful features for adding sequences to existing precomputed alignments.
Multiple sequence alignments are fundamental to many sequence analysis methods. Most alignments are computed using the progressive alignment heuristic. These methods are starting to become a bottleneck in some analysis pipelines when faced with data sets of the size of many thousands of sequences. Some methods allow computation of larger data sets while sacrificing quality, and others produce high-quality alignments, but scale badly with the number of sequences. In this paper, we describe a new program called Clustal Omega, which can align virtually any number of protein sequences quickly and that delivers accurate alignments. The accuracy of the package on smaller test cases is similar to that of the high-quality aligners. On larger data sets, Clustal Omega outperforms other packages in terms of execution time and quality. Clustal Omega also has powerful features for adding sequences to and exploiting information in existing alignments, making use of the vast amount of precomputed information in public databases like Pfam.
bioinformatics; hidden Markov models; multiple sequence alignment
The regulation of the response of Candida albicans to hypoxic (low-oxygen) conditions is poorly understood. We used microarray and other transcriptional analyses to investigate the role of the Upc2 and Bcr1 transcription factors in controlling expression of genes involved in cell wall metabolism, ergosterol synthesis, and glycolysis during adaptation to hypoxia. Hypoxic induction of the ergosterol pathway is mimicked by treatment with sterol-lowering drugs (ketoconazole) and requires UPC2. Expression of three members of the family CFEM (common in several fungal extracellular membranes) of cell wall genes (RBT5, PGA7, and PGA10) is also induced by hypoxia and ketoconazole and requires both UPC2 and BCR1. Expression of glycolytic genes is induced by hypoxia but not by treatment with sterol-lowering drugs, whereas expression of respiratory pathway genes is repressed. However, Upc2 does not play a major role in regulating expression of genes required for central carbon metabolism. Our results indicate that regulation of gene expression in response to hypoxia in C. albicans is complex and is signaled both via lowered sterol levels and other unstudied mechanisms. We also show that induction of filamentation under hypoxic conditions requires the Ras1- and Cdc35-dependent pathway.
More and more nucleotide sequences of type A influenza virus are available in public databases. Although these sequences have been the focus of many molecular epidemiological and phylogenetic analyses, most studies only deal with a few representative sequences. In this paper, we present a complete analysis of all Haemagglutinin (HA) and Neuraminidase (NA) gene sequences available to allow large scale analyses of the evolution and epidemiology of type A influenza.
This paper describes an analysis and complete classification of all HA and NA gene sequences available in public databases using multivariate and phylogenetic methods.
We analyzed 18975 HA sequences and divided them into 280 subgroups according to multivariate and phylogenetic analyses. Similarly, we divided 11362 NA sequences into 202 subgroups. Compared to previous analyses, this work is more detailed and comprehensive, especially for the bigger datasets. Therefore, it can be used to show the full and complex phylogenetic diversity and provides a framework for studying the molecular evolution and epidemiology of type A influenza virus. For more than 85% of type A influenza HA and NA sequences into GenBank, they are categorized in one unambiguous and unique group. Therefore, our results are a kind of genetic and phylogenetic annotation for influenza HA and NA sequences. In addition, sequences of swine influenza viruses come from 56 HA and 45 NA subgroups. Most of these subgroups also include viruses from other hosts indicating cross species transmission of the viruses between pigs and other hosts. Furthermore, the phylogenetic diversity of swine influenza viruses from Eurasia is greater than that of North American strains and both of them are becoming more diverse. Apart from viruses from human, pigs, birds and horses, viruses from other species show very low phylogenetic diversity. This might indicate that viruses have not become established in these species. Based on current evidence, there is no simple pattern of inter-hemisphere transmission of avian influenza viruses and it appears to happen sporadically. However, for H6 subtype avian influenza viruses, such transmissions might have happened very frequently and multiple and bidirectional transmission events might exist.
The computational prediction of transcription start sites is an important unsolved problem. Some recent progress has been made, but many promoters, particularly those not associated with CpG islands, are still difficult to locate using current methods. These methods use different features and training sets, along with a variety of machine learning techniques and result in different prediction sets.
We demonstrate the heterogeneity of current prediction sets, and take advantage of this heterogeneity to construct a two-level classifier ('Profisi Ensemble') using predictions from 7 programs, along with 2 other data sources. Support vector machines using 'full' and 'reduced' data sets are combined in an either/or approach. We achieve a 14% increase in performance over the current state-of-the-art, as benchmarked by a third-party tool.
Supervised learning methods are a useful way to combine predictions from diverse sources.
MicroRNAs (miRNAs) are non-coding RNAs that regulate gene expression by binding to the messenger RNA (mRNA) of protein coding genes. They control gene expression by either inhibiting translation or inducing mRNA degradation. A number of computational techniques have been developed to identify the targets of miRNAs. In this study we used predicted miRNA-gene interactions to analyse mRNA gene expression microarray data to predict miRNAs associated with particular diseases or conditions.
Here we combine correspondence analysis, between group analysis and co-inertia analysis (CIA) to determine which miRNAs are associated with differences in gene expression levels in microarray data sets. Using a database of miRNA target predictions from TargetScan, TargetScanS, PicTar4way PicTar5way, and miRanda and combining these data with gene expression levels from sets of microarrays, this method produces a ranked list of miRNAs associated with a specified split in samples. We applied this to three different microarray datasets, a papillary thyroid carcinoma dataset, an in-house dataset of lipopolysaccharide treated mouse macrophages, and a multi-tissue dataset. In each case we were able to identified miRNAs of biological importance.
We describe a technique to integrate gene expression data and miRNA target predictions from multiple sources.
The most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to N2 for N sequences. When N grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments.
In this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances.
We show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from http://www.clustal.org/mbed.tgz.
The Affymetrix GeneChip is a widely used gene expression profiling platform. Since the chips were originally designed, the genome databases and gene definitions have been considerably updated. Thus, more accurate interpretation of microarray data requires parallel updating of the specificity of GeneChip probes. We propose a new probe remapping protocol, using the zebrafish GeneChips as an example, by removing nonspecific probes, and grouping the probes into transcript level probe sets using an integrated zebrafish genome annotation. This genome annotation is based on combining transcript information from multiple databases. This new remapping protocol, especially the new genome annotation, is shown here to be an important factor in improving the interpretation of gene expression microarray data.
Transcript data from the RefSeq, GenBank and Ensembl databases were downloaded from the UCSC genome browser, and integrated to generate a combined zebrafish genome annotation. Affymetrix probes were filtered and remapped according to the new annotation. The influence of transcript collection and gene definition methods was tested using two microarray data sets. Compared to remapping using a single database, this new remapping protocol results in up to 20% more probes being retained in the remapping, leading to approximately 1,000 more genes being detected. The differentially expressed gene lists are consequently increased by up to 30%. We are also able to detect up to three times more alternative splicing events. A small number of the bioinformatics predictions were confirmed using real-time PCR validation.
By combining gene definitions from multiple databases, it is possible to greatly increase the numbers of genes and splice variants that can be detected in microarray gene expression experiments.
The accurate computational prediction of transcription start sites (TSS) in vertebrate genomes is a difficult problem. The physicochemical properties of DNA can be computed in various ways and a many combinations of DNA features have been tested in the past for use as predictors of transcription. We looked in detail at melting temperature, which measures the temperature, at which two strands of DNA separate, considering the cooperative nature of this process. We find that peaks in melting temperature correspond closely to experimentally determined transcription start sites in human and mouse chromosomes. Using melting temperature alone, and with simple thresholding, we can predict TSS with accuracy that is competitive with the most accurate state-of-the-art TSS prediction methods. Accuracy is measured using both experimentally and manually determined TSS. The method works especially well with CpG island containing promoters, but also works when CpG islands are absent. This result is clear evidence of the important role of the physical properties of DNA in the process of transcription. It also points to the importance for TSS prediction methods to include melting temperature as prior information.
The ability of Candida parapsilosis to form biofilms on indwelling medical devices is correlated with virulence. To identify genes that are important for biofilm formation, we used arrays representing approximately 4,000 open reading frames (ORFs) to compare the transcriptional profile of biofilm cells growing in a microfermentor under continuous flow conditions with that of cells in planktonic culture. The expression of genes involved in fatty acid and ergosterol metabolism and in glycolysis, is upregulated in biofilms. The transcriptional profile of C. parapsilosis biofilm cells resembles that of Candida albicans cells grown under hypoxic conditions. We therefore subsequently used whole-genome arrays (representing 5,900 ORFs) to determine the hypoxic response of C. parapsilosis and showed that the levels of expression of genes involved in the ergosterol and glycolytic pathways, together with several cell wall genes, are increased. Our results indicate that there is substantial overlap between the hypoxic responses of C. parapsilosis and C. albicans and that this may be important for biofilm development. Knocking out an ortholog of the cell wall gene RBT1, whose expression is induced both in biofilms and under conditions of hypoxia in C. parapsilosis, reduces biofilm development.
Rationale: Pulmonary hypertension is a common complication of chronic hypoxic lung diseases and is associated with increased morbidity and reduced survival. The pulmonary vascular changes in response to hypoxia, both structural and functional, are unique to this circulation.
Objectives: To identify transcription factor pathways uniquely activated in the lung in response to hypoxia.
Methods: After exposure to environmental hypoxia (10% O2) for varying periods (3 h to 2 wk), lungs and systemic organs were isolated from groups of adult male mice. Bioinformatic examination of genes the expression of which changed in the hypoxic lung (assessed using microarray analysis) identified potential lung-selective transcription factors controlling these changes in gene expression. In separate further experiments, lung-selective activation of these candidate transcription factors was tested in hypoxic mice and by comparing hypoxic responses of primary human pulmonary and cardiac microvascular endothelial cells in vitro.
Measurements and Main Results: Bioinformatic analysis identified cAMP response element binding (CREB) family members as candidate lung-selective hypoxia-responsive transcription factors. Further in vivo experiments demonstrated activation of CREB and activating transcription factor (ATF)1 and up-regulation of CREB family–responsive genes in the hypoxic lung, but not in other organs. Hypoxia-dependent CREB activation and CREB-responsive gene expression was observed in human primary lung, but not cardiac microvascular endothelial cells.
Conclusions: These findings suggest that activation of CREB and AFT1 plays a key role in the lung-specific responses to hypoxia, and that lung microvascular endothelial cells are important, proximal effector cells in the specific responses of the pulmonary circulation to hypoxia.
hypoxia; cAMP response element binding; pulmonary hypertension; transcription factor binding site
The R-Coffee web server produces highly accurate multiple alignments of noncoding RNA (ncRNA) sequences, taking into account predicted secondary structures. R-Coffee uses a novel algorithm recently incorporated in the T-Coffee package. R-Coffee works along the same lines as T-Coffee: it uses pairwise or multiple sequence alignment (MSA) methods to compute a primary library of input alignments. The program then computes an MSA highly consistent with both the alignments contained in the library and the secondary structures associated with the sequences. The secondary structures are predicted using RNAplfold. The server provides two modes. The slow/accurate mode is restricted to small datasets (less than 5 sequences less than 150 nucleotides) and combines R-Coffee with Consan, a very accurate pairwise RNA alignment method. For larger datasets a fast method can be used (RM-Coffee mode), that uses R-Coffee to combine the output of the three packages which combines the outputs from programs found to perform best on RNA (MUSCLE, MAFFT and ProbConsRNA). Our BRAliBase benchmarks indicate that the R-Coffee/Consan combination is one of the best ncRNA alignment methods for short sequences, while the RM-Coffee gives comparable results on longer sequences. The R-Coffee web server is available at http://www.tcoffee.org.
R-Coffee is a multiple RNA alignment package, derived from T-Coffee, designed to align RNA sequences while exploiting secondary structure information. R-Coffee uses an alignment-scoring scheme that incorporates secondary structure information within the alignment. It works particularly well as an alignment improver and can be combined with any existing sequence alignment method. In this work, we used R-Coffee to compute multiple sequence alignments combining the pairwise output of sequence aligners and structural aligners. We show that R-Coffee can improve the accuracy of all the sequence aligners. We also show that the consistency-based component of T-Coffee can improve the accuracy of several structural aligners. R-Coffee was tested on 388 BRAliBase reference datasets and on 11 longer Cmfinder datasets. Altogether our results suggest that the best protocol for aligning short sequences (less than 200 nt) is the combination of R-Coffee with the RNA pairwise structural aligner Consan. We also show that the simultaneous combination of the four best sequence alignment programs with R-Coffee produces alignments almost as accurate as those obtained with R-Coffee/Consan. Finally, we show that R-Coffee can also be used to align longer datasets beyond the usual scope of structural aligners. R-Coffee is freely available for download, along with documentation, from the T-Coffee web site (www.tcoffee.org).
The M-Coffee server is a web server that makes it possible to compute multiple sequence alignments (MSAs) by running several MSA methods and combining their output into one single model. This allows the user to simultaneously run all his methods of choice without having to arbitrarily choose one of them. The MSA is delivered along with a local estimation of its consistency with the individual MSAs it was derived from. The computation of the consensus multiple alignment is carried out using a special mode of the T-Coffee package [Notredame, Higgins and Heringa (T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 2000; 302: 205–217); Wallace, O'Sullivan, Higgins and Notredame (M-Coffee: combining multiple sequence alignment methods with T-Coffee. Nucleic Acids Res. 2006; 34: 1692–1699)] Given a set of sequences (DNA or proteins) in FASTA format, M-Coffee delivers a multiple alignment in the most common formats. M-Coffee is a freeware open source package distributed under a GPL license and it is available either as a standalone package or as a web service from www.tcoffee.org.
Proteins that evolve from a common ancestor can change functionality over time, and it is important to be able identify residues that cause this change. In this paper we show how a supervised multivariate statistical method, Between Group Analysis (BGA), can be used to identify these residues from families of proteins with different substrate specifities using multiple sequence alignments.
We demonstrate the usefulness of this method on three different test cases. Two of these test cases, the Lactate/Malate dehydrogenase family and Nucleotidyl Cyclases, consist of two functional groups. The other family, Serine Proteases consists of three groups. BGA was used to analyse and visualise these three families using two different encoding schemes for the amino acids.
This overall combination of methods in this paper is powerful and flexible while being computationally very fast and simple. BGA is especially useful because it can be used to analyse any number of functional classes. In the examples we used in this paper, we have only used 2 or 3 classes for demonstration purposes but any number can be used and visualised.
Numerous feature selection methods have been applied to the identification of differentially expressed genes in microarray data. These include simple fold change, classical t-statistic and moderated t-statistics. Even though these methods return gene lists that are often dissimilar, few direct comparisons of these exist. We present an empirical study in which we compare some of the most commonly used feature selection methods. We apply these to 9 publicly available datasets, and compare, both the gene lists produced and how these perform in class prediction of test datasets.
In this study, we compared the efficiency of the feature selection methods; significance analysis of microarrays (SAM), analysis of variance (ANOVA), empirical bayes t-statistic, template matching, maxT, between group analysis (BGA), Area under the receiver operating characteristic (ROC) curve, the Welch t-statistic, fold change, rank products, and sets of randomly selected genes. In each case these methods were applied to 9 different binary (two class) microarray datasets. Firstly we found little agreement in gene lists produced by the different methods. Only 8 to 21% of genes were in common across all 10 feature selection methods. Secondly, we evaluated the class prediction efficiency of each gene list in training and test cross-validation using four supervised classifiers.
We report that the choice of feature selection method, the number of genes in the genelist, the number of cases (samples) and the noise in the dataset, substantially influence classification success. Recommendations are made for choice of feature selection. Area under a ROC curve performed well with datasets that had low levels of noise and large sample size. Rank products performs well when datasets had low numbers of samples or high levels of noise. The Empirical bayes t-statistic performed well across a range of sample sizes.
We introduce M-Coffee, a meta-method for assembling multiple sequence alignments (MSA) by combining the output of several individual methods into one single MSA. M-Coffee is an extension of T-Coffee and uses consistency to estimate a consensus alignment. We show that the procedure is robust to variations in the choice of constituent methods and reasonably tolerant to duplicate MSAs. We also show that performances can be improved by carefully selecting the constituent methods. M-Coffee outperforms all the individual methods on three major reference datasets: HOMSTRAD, Prefab and Balibase. We also show that on a case-by-case basis, M-Coffee is twice as likely to deliver the best alignment than any individual method. Given a collection of pre-computed MSAs, M-Coffee has similar CPU requirements to the original T-Coffee. M-Coffee is a freeware open-source package available from .
One of the main goals of cancer genetics is to identify the causative elements at the molecular level leading to cancer.
We have conducted an analysis of a set of genes known to be involved in cancer in order to unveil their unique features that can assist towards the identification of new candidate cancer genes.
We have detected key patterns in this group of genes in terms of the molecular function or the biological process in which they are involved as well as sequence properties. Based on these features we have developed an accurate Bayesian classification model with which human genes have been scored for their likelihood of involvement in cancer.
Imprinted genes exhibit silencing of one of the parental alleles during embryonic
development. In a previous study imprinted genes were found to have reduced intron
content relative to a non-imprinted control set (Hurst et al., 1996). However, due
to the small sample size, it was not possible to analyse the source of this effect.
Here, we re-investigate this observation using larger datasets of imprinted and control
(non-imprinted) genes that allow us to consider mouse and human, and maternally
and paternally silenced, imprinted genes separately. We find that, in the human and
mouse, there is reduced intron content in the maternally silenced imprinted genes
relative to a non-imprinted control set. Among imprinted genes, a strong bias is
also observed in the distribution of intronless genes, which are found exclusively
in the maternally silenced dataset. The paternally silenced dataset in the human is
not different to the control set; however, the mouse paternally silenced dataset has
more introns than the control group. A direct comparison of mouse maternally and
paternally silenced imprinted gene datasets shows that they differ significantly with
respect to a variety of intron-related parameters. We discuss a variety of possible
explanations for our observations.
Rapid development of DNA microarray technology has resulted in different laboratories adopting numerous different protocols and technological platforms, which has severely impacted on the comparability of array data. Current cross-platform comparison of microarray gene expression data are usually based on cross-referencing the annotation of each gene transcript represented on the arrays, extracting a list of genes common to all arrays and comparing expression data of this gene subset. Unfortunately, filtering of genes to a subset represented across all arrays often excludes many thousands of genes, because different subsets of genes from the genome are represented on different arrays. We wish to describe the application of a powerful yet simple method for cross-platform comparison of gene expression data. Co-inertia analysis (CIA) is a multivariate method that identifies trends or co-relationships in multiple datasets which contain the same samples. CIA simultaneously finds ordinations (dimension reduction diagrams) from the datasets that are most similar. It does this by finding successive axes from the two datasets with maximum covariance. CIA can be applied to datasets where the number of variables (genes) far exceeds the number of samples (arrays) such is the case with microarray analyses.
We illustrate the power of CIA for cross-platform analysis of gene expression data by using it to identify the main common relationships in expression profiles on a panel of 60 tumour cell lines from the National Cancer Institute (NCI) which have been subjected to microarray studies using both Affymetrix and spotted cDNA array technology. The co-ordinates of the CIA projections of the cell lines from each dataset are graphed in a bi-plot and are connected by a line, the length of which indicates the divergence between the two datasets. Thus, CIA provides graphical representation of consensus and divergence between the gene expression profiles from different microarray platforms. Secondly, the genes that define the main trends in the analysis can be easily identified.
CIA is a robust, efficient approach to coupling of gene expression datasets. CIA provides simple graphical representations of the results making it a particularly attractive method for the identification of relationships between large datasets.
The Clustal series of programs are widely used in molecular biology for the multiple alignment of both nucleic acid and protein sequences and for preparing phylogenetic trees. The popularity of the programs depends on a number of factors, including not only the accuracy of the results, but also the robustness, portability and user-friendliness of the programs. New features include NEXUS and FASTA format output, printing range numbers and faster tree calculation. Although, Clustal was originally developed to run on a local computer, numerous Web servers have been set up, notably at the EBI (European Bioinformatics Institute) (http://www.ebi.ac.uk/clustalw/).