1.  Genomic divergence between nine- and three-spined sticklebacks 
BMC Genomics  2013;14(1):756.
Comparative genomics approaches help to shed light on evolutionary processes that shape differentiation between lineages. The nine-spined stickleback (Pungitius pungitius) is a closely related species of the ecological ‘supermodel’ three-spined stickleback (Gasterosteus aculeatus). It is an emerging model system for evolutionary biology research but has garnered less attention and lacks extensive genomic resources. To expand on these resources and aid the study of sticklebacks in a phylogenetic framework, we characterized nine-spined stickleback transcriptomes from brain and liver using deep sequencing.
We obtained nearly eight thousand assembled transcripts, of which 3,091 were assigned as putative one-to-one orthologs to genes found in the three-spined stickleback. These sequences were used for evaluating overall differentiation and substitution rates between nine- and three-spined sticklebacks, and to identify genes that are putatively evolving under positive selection. The synonymous substitution rate was estimated to be 7.1 × 10-9 per site per year between the two species, and a total of 165 genes showed patterns of adaptive evolution in one or both species. A few nine-spined stickleback contigs lacked an obvious ortholog in three-spined sticklebacks but were found to match genes in other fish species, suggesting several gene losses within 13 million years since the divergence of the two stickleback species. We identified 47 SNPs in 25 different genes that differentiate pond and marine ecotypes. We also identified 468 microsatellites that could be further developed as genetic markers in nine-spined sticklebacks.
With deep sequencing of nine-spined stickleback cDNA libraries, our study provides a significant increase in the number of gene sequences and microsatellite markers for this species, and identifies a number of genes showing patterns of adaptive evolution between nine- and three-spined sticklebacks. We also report several candidate genes that might be involved in differential adaptation between marine and freshwater nine-spined sticklebacks. This study provides a valuable resource for future studies aiming to identify candidate genes underlying ecological adaptation in this and other stickleback species.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-14-756) contains supplementary material, which is available to authorized users.
PMCID: PMC4046692  PMID: 24188282
Pungitus pungitius; Gasterosteus aculeatus; Comparative genomics; Transcriptome; Substitution rate; Adaptive evolution
2.  Characterization of the Zoarces viviparus liver transcriptome using massively parallel pyrosequencing 
BMC Genomics  2009;10:345.
The teleost Zoarces viviparus (eelpout) lives along the coasts of Northern Europe and has long been an established model organism for marine ecology and environmental monitoring. The scarce information about this species genome has however restrained the use of efficient molecular-level assays, such as gene expression microarrays.
In the present study we present the first comprehensive characterization of the Zoarces viviparus liver transcriptome. From 400,000 reads generated by massively parallel pyrosequencing, more than 50,000 pieces of putative transcripts were assembled, annotated and functionally classified. The data was estimated to cover roughly 40% of the total transcriptome and homologues for about half of the genes of Gasterosteus aculeatus (stickleback) were identified. The sequence data was consequently used to design an oligonucleotide microarray for large-scale gene expression analysis.
Our results show that one run using a Genome Sequencer FLX from 454 Life Science/Roche generates enough genomic information for adequate de novo assembly of a large number of genes in a higher vertebrate. The generated sequence data, including the validated microarray probes, are publicly available to promote genome-wide research in Zoarces viviparus.
PMCID: PMC2725146  PMID: 19646242
3.  Development and evaluation of new mask protocols for gene expression profiling in humans and chimpanzees 
BMC Bioinformatics  2009;10:77.
Cross-species gene expression analyses using oligonucleotide microarrays designed to evaluate a single species can provide spurious results due to mismatches between the interrogated transcriptome and arrayed probes. Based on the most recent human and chimpanzee genome assemblies, we developed updated and accessible probe masking methods that allow human Affymetrix oligonucleotide microarrays to be used for robust genome-wide expression analyses in both species. In this process, only data from oligonucleotide probes predicted to have robust hybridization sensitivity and specificity for both transcriptomes are retained for analysis.
To characterize the utility of this resource, we applied our mask protocols to existing expression data from brains, livers, hearts, testes, and kidneys derived from both species and determined the effects probe numbers have on expression scores of specific transcripts. In all five tissues, probe sets with decreasing numbers of probes showed non-linear trends towards increased variation in expression scores. The relationships between expression variation and probe number in brain data closely matched those observed in simulated expression data sets subjected to random probe masking. However, there is evidence that additional factors affect the observed relationships between gene expression scores and probe number in tissues such as liver and kidney. In parallel, we observed that decreasing the number of probes within probe sets lead to linear increases in both gained and lost inferences of differential cross-species expression in all five tissues, which will affect the interpretation of expression data subject to masking.
We introduce a readily implemented and updated resource for human and chimpanzee transcriptome analysis through a commonly used microarray platform. Based on empirical observations derived from the analysis of five distinct data sets, we provide novel guidelines for the interpretation of masked data that take the number of probes present in a given probe set into consideration. These guidelines are applicable to other customized applications that involve masking data from specific subsets of probes.
PMCID: PMC2660304  PMID: 19265541
4.  Cross-platform expression microarray performance in a mouse model of mitochondrial disease therapy 
Molecular genetics and metabolism  2009;99(3):309-318.
Microarray expression profiling has become a valuable tool in the evaluation of the genetic consequences of metabolic disease. Although 3′-biased gene expression microarray platforms were the first generation to have widespread availability, newer platforms are gradually emerging that have more up-to-date content and/or higher cost efficiency. Deciphering the relative strengths and weaknesses of these various platforms for metabolic pathway level analyses can be daunting. We sought to determine the practical strengths and weaknesses of four leading commercially-available expression array platforms relative to biologic investigations, as well as assess the feasibility of cross-platform data integration for purposes of biochemical pathway analyses.
Liver RNA from B6.Alb/cre,Pdss2loxP/loxP mice having primary Coenzyme Q deficiency was extracted either at baseline or following treatment with an antioxidant/antihyperlipidemic agent, probucol. Target RNA samples were prepared and hybridized to Affymetrix 430 2.0, Affymetrix Gene 1.0 ST, Affymetrix Exon 1.0 ST, and Illumina Mouse WG-6 expression arrays. Probes on all platforms were re-mapped to coding sequences in the current version of the mouse genome. Data processing and statistical analysis were performed by R/Bioconductor functions, and pathway analyses were carried out by KEGG Atlas and GSEA.
Expression measurements were generally consistent across platforms. However, intensive probe-level comparison suggested that differences in probe locations were a major source of inter-platform variance. In addition, genes expressed at low or intermediate levels had lower inter-platform reproducibility than highly expressed genes. All platforms showed similar patterns of differential expression between sample groups, with steroid biosynthesis consistently identified as the most down-regulated metabolic pathway by probucol treatment.
This work offers a timely guide for metabolic disease investigators to enable informed end-user decisions regarding choice of expression microarray platform best-suited to specific research project goals. Successful cross-platform integration of biochemical pathway expression data is also demonstrated, especially for well-annotated and highly-expressed genes. However, integration of gene-level expression data is limited by individual platform probe design and the expression level of target genes. Cross-platform analyses of biochemical pathway data will require additional data processing and novel computational bioinformatics tools to address unique statistical challenges.
PMCID: PMC2824080  PMID: 19944634
5.  Utility of sequenced genomes for microsatellite marker development in non-model organisms: a case study of functionally important genes in nine-spined sticklebacks (Pungitius pungitius) 
BMC Genomics  2010;11:334.
Identification of genes involved in adaptation and speciation by targeting specific genes of interest has become a plausible strategy also for non-model organisms. We investigated the potential utility of available sequenced fish genomes to develop microsatellite (cf. simple sequence repeat, SSR) markers for functionally important genes in nine-spined sticklebacks (Pungitius pungitius), as well as cross-species transferability of SSR primers from three-spined (Gasterosteus aculeatus) to nine-spined sticklebacks. In addition, we examined the patterns and degree of SSR conservation between these species using their aligned sequences.
Cross-species amplification success was lower for SSR markers located in or around functionally important genes (27 out of 158) than for those randomly derived from genomic (35 out of 101) and cDNA (35 out of 87) libraries. Polymorphism was observed at a large proportion (65%) of the cross-amplified loci independently of SSR type. To develop SSR markers for functionally important genes in nine-spined sticklebacks, SSR locations were surveyed in or around 67 target genes based on the three-spined stickleback genome and these regions were sequenced with primers designed from conserved sequences in sequenced fish genomes. Out of the 81 SSRs identified in the sequenced regions (44,084 bp), 57 exhibited the same motifs at the same locations as in the three-spined stickleback. Di- and trinucleotide SSRs appeared to be highly conserved whereas mononucleotide SSRs were less so. Species-specific primers were designed to amplify 58 SSRs using the sequences of nine-spined sticklebacks.
Our results demonstrated that a large proportion of SSRs are conserved in the species that have diverged more than 10 million years ago. Therefore, the three-spined stickleback genome can be used to predict SSR locations in the nine-spined stickleback genome. While cross-species utility of SSR primers is limited due to low amplification success, SSR markers can be developed for target genes and genomic regions using our approach, which should be also applicable to other non-model organisms. The SSR markers developed in this study should be useful for identification of genes responsible for phenotypic variation and adaptive divergence of nine-spined stickleback populations, as well as for constructing comparative gene maps of nine-spined and three-spined sticklebacks.
PMCID: PMC2891615  PMID: 20507571
6.  High-density rhesus macaque oligonucleotide microarray design using early-stage rhesus genome sequence information and human genome annotations 
BMC Genomics  2007;8:28.
Until recently, few genomic reagents specific for non-human primate research have been available. To address this need, we have constructed a macaque-specific high-density oligonucleotide microarray by using highly fragmented low-pass sequence contigs from the rhesus genome project together with the detailed sequence and exon structure of the human genome. Using this method, we designed oligonucleotide probes to over 17,000 distinct rhesus/human gene orthologs and increased by four-fold the number of available genes relative to our first-generation expressed sequence tag (EST)-derived array.
We constructed a database containing 248,000 exon sequences from 23,000 human RefSeq genes and compared each human exon with its best matching sequence in the January 2005 version of the rhesus genome project list of 486,000 DNA contigs. Best matching rhesus exon sequences for each of the 23,000 human genes were then concatenated in the proper order and orientation to produce a rhesus "virtual transcriptome." Microarray probes were designed, one per gene, to the region closest to the 3' untranslated region (UTR) of each rhesus virtual transcript. Each probe was compared to a composite rhesus/human transcript database to test for cross-hybridization potential yielding a final probe set representing 18,296 rhesus/human gene orthologs, including transcript variants, and over 17,000 distinct genes. We hybridized mRNA from rhesus brain and spleen to both the EST- and genome-derived microarrays. Besides four-fold greater gene coverage, the genome-derived array also showed greater mean signal intensities for genes present on both arrays. Genome-derived probes showed 99.4% identity when compared to 4,767 rhesus GenBank sequence tag site (STS) sequences indicating that early stage low-pass versions of complex genomes are of sufficient quality to yield valuable functional genomic information when combined with finished genome information from a closely related species.
The number of different genes represented on microarrays for unfinished genomes can be greatly increased by matching known gene transcript annotations from a closely related species with sequence data from the unfinished genome. Signal intensity on both EST- and genome-derived arrays was highly correlated with probe distance from the 3' UTR, information often missing from ESTs yet present in early-stage genome projects.
PMCID: PMC1790710  PMID: 17244361
7.  Design, Validation and Annotation of Transcriptome-Wide Oligonucleotide Probes for the Oligochaete Annelid Eisenia fetida 
PLoS ONE  2010;5(12):e14266.
High density oligonucleotide probe arrays have increasingly become an important tool in genomics studies. In organisms with incomplete genome sequence, one strategy for oligo probe design is to reduce the number of unique probes that target every non-redundant transcript through bioinformatic analysis and experimental testing. Here we adopted this strategy in making oligo probes for the earthworm Eisenia fetida, a species for which we have sequenced transcriptome-scale expressed sequence tags (ESTs). Our objectives were to identify unique transcripts as targets, to select an optimal and non-redundant oligo probe for each of these target ESTs, and to annotate the selected target sequences. We developed a streamlined and easy-to-follow approach to the design, validation and annotation of species-specific array probes. Four 244K-formatted oligo arrays were designed using eArray and were hybridized to a pooled E. fetida cRNA sample. We identified 63,541 probes with unsaturated signal intensities consistently above the background level. Target transcripts of these probes were annotated using several sequence alignment algorithms. Significant hits were obtained for 37,439 (59%) probed targets. We validated and made publicly available 63.5K oligo probes so the earthworm research community can use them to pursue ecological, toxicological, and other functional genomics questions. Our approach is efficient, cost-effective and robust because it (1) does not require a major genomics core facility; (2) allows new probes to be easily added and old probes modified or eliminated when new sequence information becomes available, (3) is not bioinformatics-intensive upfront but does provide opportunities for more in-depth annotation of biological functions for target genes; and (4) if desired, EST orthologs to the UniGene clusters of a reference genome can be identified and selected in order to improve the target gene specificity of designed probes. This approach is particularly applicable to organisms with a wealth of EST sequences but unfinished genome.
PMCID: PMC2999564  PMID: 21170345
8.  Comparison of RNA-Seq and Microarray in Transcriptome Profiling of Activated T Cells 
PLoS ONE  2014;9(1):e78644.
To demonstrate the benefits of RNA-Seq over microarray in transcriptome profiling, both RNA-Seq and microarray analyses were performed on RNA samples from a human T cell activation experiment. In contrast to other reports, our analyses focused on the difference, rather than similarity, between RNA-Seq and microarray technologies in transcriptome profiling. A comparison of data sets derived from RNA-Seq and Affymetrix platforms using the same set of samples showed a high correlation between gene expression profiles generated by the two platforms. However, it also demonstrated that RNA-Seq was superior in detecting low abundance transcripts, differentiating biologically critical isoforms, and allowing the identification of genetic variants. RNA-Seq also demonstrated a broader dynamic range than microarray, which allowed for the detection of more differentially expressed genes with higher fold-change. Analysis of the two datasets also showed the benefit derived from avoidance of technical issues inherent to microarray probe performance such as cross-hybridization, non-specific hybridization and limited detection range of individual probes. Because RNA-Seq does not rely on a pre-designed complement sequence detection probe, it is devoid of issues associated with probe redundancy and annotation, which simplified interpretation of the data. Despite the superior benefits of RNA-Seq, microarrays are still the more common choice of researchers when conducting transcriptional profiling experiments. This is likely because RNA-Seq sequencing technology is new to most researchers, more expensive than microarray, data storage is more challenging and analysis is more complex. We expect that once these barriers are overcome, the RNA-Seq platform will become the predominant tool for transcriptome analysis.
PMCID: PMC3894192  PMID: 24454679
9.  Characterization of a newly developed chicken 44K Agilent microarray 
BMC Genomics  2008;9:60.
The development of microarray technology has greatly enhanced our ability to evaluate gene expression. In theory, the expression of all genes in a given organism can be monitored simultaneously. Sequencing of the chicken genome has provided the crucial information for the design of a comprehensive chicken transcriptome microarray. A long oligonucleotide microarray has been manually curated and designed by our group and manufactured using Agilent inkjet technology. This provides a flexible and powerful platform with high sensitivity and specificity for gene expression studies.
A chicken 60-mer oligonucleotide microarray consisting of 42,034 features including the entire Marek's disease virus, two avian influenza virus (H5N2 and H5N3), and 150 chicken microRNAs has been designed and tested. In an important validation study, total RNA isolated from four major chicken tissues: cecal tonsil (C), ileum (I), liver (L), and spleen (S) were used for comparative hybridizations. More than 95% of spots had high signal noise ratio (SNR > 10). There were 2886, 2660, 358, 3208, 3355, and 3710 genes differentially expressed between liver and spleen, spleen and cecal tonsil, cecal tonsil and ileum, liver and cecal tonsil, liver and ileum, spleen and ileum (P < 10-7), respectively. There were a number of tissue-selective genes for cecal tonsil, ileum, liver, and spleen identified (95, 71, 535, and 108, respectively; P < 10-7). Another highlight of these data revealed that the antimicrobial peptides GAL1, GAL2, GAL6 and GAL7 were highly expressed in the spleen compared to other tissues tested.
A chicken 60-mer oligonucleotide 44K microarray was designed and validated in a comprehensive survey of gene expression in diverse tissues. The results of these tissue expression analyses have demonstrated that this microarray has high specificity and sensitivity, and will be a useful tool for chicken functional genomics. Novel data on the expression of putative tissue specific genes and antimicrobial peptides is highlighted as part of this comprehensive microarray validation study. The information for accessing and ordering this 44K chicken array can be found at
PMCID: PMC2262898  PMID: 18237426
10.  UPS 2.0: unique probe selector for probe design and oligonucleotide microarrays at the pangenomic/ genomic level 
BMC Genomics  2010;11(Suppl 4):S6.
Nucleic acid hybridization is an extensively adopted principle in biomedical research, in which the performance of any hybridization-based method depends on the specificity of probes to their targets. To determine the optimal probe(s) for detecting target(s) from a sample cocktail, we developed a novel algorithm, which has been implemented into a web platform for probe designing. This probe design workflow is now upgraded to satisfy experiments that require a probe designing tool to take the increasing volume of sequence datasets.
Algorithms and probe parameters applied in UPS 2.0 include GC content, the secondary structure, melting temperature (Tm), the stability of the probe-target duplex estimated by the thermodynamic model, sequence complexity, similarity of probes to non-target sequences, and other empirical parameters used in the laboratory. Several probe background options,Unique probe within a group,Unique probe in a specific Unigene set,Unique probe based onthe pangenomic level, and Unique Probe in the user-defined genome/transcriptome, are available to meet the scenarios that the experiments will be conducted. Parameters, such as salt concentration and the lower-bound Tm of probes, are available for users to optimize their probe design query. Output files are available for download on the result page. Probes designed by the UPS algorithm are suitable for generating microarrays, and the performance of UPS-designed probes has been validated by experiments.
The UPS 2.0 evaluates probe-to-target hybridization under a user-defined condition to ensure high-performance hybridization with minimal chance of non-specific binding at the pangenomic and genomic levels. The UPS algorithm mimics the target/non-target mixture in an experiment and is very useful in developing diagnostic kits and microarrays. The UPS 2.0 website has had more than 1,300 visits and 360,000 sequences performed the probe designing task in the last 30 months. It is freely accessible at
Screen cast:
PMCID: PMC3005932  PMID: 21143815
11.  Characterization of Common Carp Transcriptome: Sequencing, De Novo Assembly, Annotation and Comparative Genomics 
PLoS ONE  2012;7(4):e35152.
Common carp (Cyprinus carpio) is one of the most important aquaculture species of Cyprinidae with an annual global production of 3.4 million tons, accounting for nearly 14% of the freshwater aquaculture production in the world. Due to the economical and ecological importance of common carp, genomic data are eagerly needed for genetic improvement purpose. However, there is still no sufficient transcriptome data available. The objective of the project is to sequence transcriptome deeply and provide well-assembled transcriptome sequences to common carp research community.
Transcriptome sequencing of common carp was performed using Roche 454 platform. A total of 1,418,591 clean ESTs were collected and assembled into 36,811 cDNA contigs, with average length of 888 bp and N50 length of 1,002 bp. Annotation was performed and a total of 19,165 unique proteins were identified from assembled contigs. Gene ontology and KEGG analysis were performed and classified all contigs into functional categories for understanding gene functions and regulation pathways. Open Reading Frames (ORFs) were detected from 29,869 (81.1%) contigs with an average ORF length of 763 bp. From these contigs, 9,625 full-length cDNAs were identified with sequence length from 201 bp to 9,956 bp. Comparative analysis revealed that 27,693(75.2%) contigs have significant similarity to zebrafish Refseq proteins, and 24,371(66.2%), 24,501(66.5%) and 25,025(70.0%) to teraodon, medaka and three-spined stickleback refseq proteins. A total of 2,064 microsatellites were initially identified from 1,730 contigs, and 1,639 unique sequences had sufficient flanking sequences on both sides for primer design.
The transcriptome of common carp had been deep sequenced, de novo assembled and characterized, providing the valuable resource for better understanding of common carp genome. The transcriptome data will facilitate future functional studies on common carp genome, and gradually apply in breeding programs of common carp, as well as closely related other Cyprinids.
PMCID: PMC3325976  PMID: 22514716
12.  Development and validation of a gene expression oligo microarray for the gilthead sea bream (Sparus aurata) 
BMC Genomics  2008;9:580.
Aquaculture represents the most sustainable alternative of seafood supply to substitute for the declining marine fisheries, but severe production bottlenecks remain to be solved. The application of genomic technologies offers much promise to rapidly increase our knowledge on biological processes in farmed species and overcome such bottlenecks. Here we present an integrated platform for mRNA expression profiling in the gilthead sea bream (Sparus aurata), a marine teleost of great importance for aquaculture.
A public data base was constructed, consisting of 19,734 unique clusters (3,563 contigs and 16,171 singletons). Functional annotation was obtained for 8,021 clusters. Over 4,000 sequences were also associated with a GO entry. Two 60mer probes were designed for each gene and in-situ synthesized on glass slides using Agilent SurePrint™ technology. Platform reproducibility and accuracy were assessed on two early stages of sea bream development (one-day and four days old larvae). Correlation between technical replicates was always > 0.99, with strong positive correlation between paired probes. A two class SAM test identified 1,050 differentially expressed genes between the two developmental stages. Functional analysis suggested that down-regulated transcripts (407) in older larvae are mostly essential/housekeeping genes, whereas tissue-specific genes are up-regulated in parallel with the formation of key organs (eye, digestive system). Cross-validation of microarray data was carried out using quantitative qRT-PCR on 11 target genes, selected to reflect the whole range of fold-change and both up-regulated and down-regulated genes. A statistically significant positive correlation was obtained comparing expression levels for each target gene across all biological replicates. Good concordance between qRT-PCR and microarray data was observed between 2- and 7-fold change, while fold-change compression in the microarray was present for differences greater than 10-fold in the qRT-PCR.
A highly reliable oligo-microarray platform was developed and validated for the gilthead sea bream despite the presently limited knowledge of the species transcriptome. Because of the flexible design this array will be able to accommodate additional probes as soon as novel unique transcripts are available.
PMCID: PMC2648989  PMID: 19055773
13.  Identification of SNPs and INDELS in swine transcribed sequences using short oligonucleotide microarrays 
BMC Genomics  2008;9:252.
Genome-wide detection of single feature polymorphisms (SFP) in swine using transcriptome profiling of day 25 placental RNA by contrasting probe intensities from either Meishan or an occidental composite breed with Affymetrix porcine microarrays is presented. A linear mixed model analysis was used to identify significant breed-by-probe interactions.
Gene specific linear mixed models were fit to each of the log2 transformed probe intensities on these arrays, using fixed effects for breed, probe, breed-by-probe interaction, and a random effect for array. After surveying the day 25 placental transcriptome, 857 probes with a q-value ≤ 0.05 and |fold change| ≥ 2 for the breed-by-probe interaction were identified as candidates containing SFP. To address the quality of the bioinformatics approach, universal pyrosequencing assays were designed from Affymetrix exemplar sequences to independently assess polymorphisms within a subset of probes for validation. Additionally probes were randomly selected for sequencing to determine an unbiased confirmation rate. In most cases, the 25-mer probe sequence printed on the microarray diverged from Meishan, not occidental crosses. This analysis was used to define a set of highly reliable predicted SFPs according to their probability scores.
By applying a SFP detection method to two mammalian breeds for the first time, we detected transition and transversion single nucleotide polymorphisms, as well as insertions/deletions which can be used to rapidly develop markers for genetic mapping and association analysis in species where high density genotyping platforms are otherwise unavailable.
SNPs and INDELS discovered by this approach have been publicly deposited in NCBI's SNP repository dbSNP. This method is an attractive bioinformatics tool for uncovering breed-by-probe interactions, for rapidly identifying expressed SNPs, for investigating potential functional correlations between gene expression and breed polymorphisms, and is robust enough to be used on any Affymetrix gene expression platform.
PMCID: PMC2442091  PMID: 18510738
14.  Strand-specific transcriptome profiling with directly labeled RNA on genomic tiling microarrays 
With lower manufacturing cost, high spot density, and flexible probe design, genomic tiling microarrays are ideal for comprehensive transcriptome studies. Typically, transcriptome profiling using microarrays involves reverse transcription, which converts RNA to cDNA. The cDNA is then labeled and hybridized to the probes on the arrays, thus the RNA signals are detected indirectly. Reverse transcription is known to generate artifactual cDNA, in particular the synthesis of second-strand cDNA, leading to false discovery of antisense RNA. To address this issue, we have developed an effective method using RNA that is directly labeled, thus by-passing the cDNA generation. This paper describes this method and its application to the mapping of transcriptome profiles.
RNA extracted from laboratory cultures of Porphyromonas gingivalis was fluorescently labeled with an alkylation reagent and hybridized directly to probes on genomic tiling microarrays specifically designed for this periodontal pathogen. The generated transcriptome profile was strand-specific and produced signals close to background level in most antisense regions of the genome. In contrast, high levels of signal were detected in the antisense regions when the hybridization was done with cDNA. Five antisense areas were tested with independent strand-specific RT-PCR and none to negligible amplification was detected, indicating that the strong antisense cDNA signals were experimental artifacts.
An efficient method was developed for mapping transcriptome profiles specific to both coding strands of a bacterial genome. This method chemically labels and uses extracted RNA directly in microarray hybridization. The generated transcriptome profile was free of cDNA artifactual signals. In addition, this method requires fewer processing steps and is potentially more sensitive in detecting small amount of RNA compared to conventional end-labeling methods due to the incorporation of more fluorescent molecules per RNA fragment.
PMCID: PMC3031212  PMID: 21235785
15.  Transcription and redox enzyme activities: comparison of equilibrium and disequilibrium levels in the three-spined stickleback 
Evolutionary and acclimatory responses require functional variability, but in contrast with mRNA and protein abundance data, most physiological measurements cannot be obtained in a high-throughput manner. Consequently, one must either rely on high-throughput transcriptomic or proteomic data with only predicted functional information, or accept the limitation that most physiological measurements can give fewer data than those provided by transcriptomics or proteomics. We evaluated how transcriptional and redox enzyme activity data agreed with regard to population differentiation (i.e. a system in steady state in which any time lag between transcription, translation and post-translational effects would be irrelevant) and in response to an acute 6°C increase in temperature (i.e. a disequilibrium state wherein translation could not have caught up with transcription) in the three-spined stickleback (Gasterosteus aculeatus). Transcriptional and enzyme activity data corresponded well with regard to population differentiation, but less so with regard to acute temperature increase. The data thus suggest that transcriptional and functional measurements can lead to similar conclusions when a biological system is in a steady state. The responses to acute changes must, as has been demonstrated earlier, be based on changes in cellular conditions or properties of existing proteins without significant de novo synthesis of new gene products.
PMCID: PMC3574399  PMID: 23363636
mRNA–protein correlation; temperature; population differentiation
16.  Transcriptome assembly and microarray construction for Enchytraeus crypticus, a model oligochaete to assess stress response mechanisms derived from soil conditions 
BMC Genomics  2014;15:302.
The soil worm Enchytraeus crypticus (Oligochaeta) is an ecotoxicology model species that, until now, was without genome or transcriptome sequence information. The present research aims at studying the transcriptome of Enchytraeus crypticus, sampled from multiple test conditions, and the construction of a high-density microarray for functional genomic studies.
Over 1.5 million cDNA sequence reads were obtained representing 645 million nucleotides. After assembly, 27,296 contigs and 87,686 singletons were obtained, from which 44% and 25% are annotated as protein-coding genes, respectively, sharing homology with other animal proteomes. Concerning assembly quality, 84% of the contig sequences contain an open reading frame with a start codon while E. crypticus homologs were identified for 92% of the core eukaryotic genes. Moreover, 65% and 77% of the singletons and contigs without known homologs, respectively, were shown to be transcribed in an independent microarray experiment. An Agilent 180 K microarray platform was designed and validated by hybridizing cDNA from 4 day zinc- exposed E. crypticus to the concentration corresponding to 50% reduction in reproduction after three weeks (EC50). Overall, 70% of all probes signaled expression above background levels (mean signal + 1x standard deviation). More specifically, the probes derived from contigs showed a wider range of average intensities when compared to probes derived from singletons. In total, 522 significantly differentially regulated transcripts were identified upon zinc exposure. Several significantly regulated genes exerted predicted functions (e.g. zinc efflux, zinc transport) associated with zinc stress. Unexpectedly, the microarray data suggest that zinc exposure alters retro transposon activity in the E. crypticus genome.
An initial investigation of the E. crypticus transcriptome including an associated microarray platform for future studies proves to be a valuable resource to investigate functional genomics mechanisms of toxicity in soil environments and to annotate a potentially large number of lineage specific genes that are responsive to environmental stress conditions.
PMCID: PMC4234436  PMID: 24758194
Ecotoxicogenomics; Next-generation pyrosequencing; Invertebrate; Zinc; Annelid; 454 sequencing
17.  New Agilent platform DNA microarrays for transcriptome analysis of Plasmodium falciparum and Plasmodium berghei for the malaria research community 
Malaria Journal  2012;11:187.
DNA microarrays have been a valuable tool in malaria research for over a decade but remain in limited use in part due their relatively high cost, poor availability, and technical difficulty. With the aim of alleviating some of these factors next-generation DNA microarrays for genome-wide transcriptome analysis for both Plasmodium falciparum and Plasmodium berghei using the Agilent 8x15K platform were designed.
Probe design was adapted from previously published methods and based on the most current transcript predictions available at the time for P. falciparum or P. berghei. Array performance and transcriptome analysis was determined using dye-coupled, aminoallyl-labelled cDNA and streamlined methods for hybridization, washing, and array analysis were developed.
The new array design marks a notable improvement in the number of transcripts covered and average number of probes per transcript. Array performance was excellent across a wide range of transcript abundance, with low inter-array and inter-probe variability for relative abundance measurements and it recapitulated previously observed transcriptional patterns. Additionally, improvements in sensitivity permitted a 20-fold reduction in necessary starting RNA amounts, further reducing experimental costs and widening the range of application.
DNA microarrays utilizing the Agilent 8x15K platform for genome-wide transcript analysis in P. falciparum and P. berghei mark an improvement in coverage and sensitivity, increased availability to the research community, and simplification of the experimental methods.
PMCID: PMC3411454  PMID: 22681930
18.  Comparison of Affymetrix Gene Array with the Exon Array shows potential application for detection of transcript isoform variation 
BMC Genomics  2009;10:519.
The emergence of isoform-sensitive microarrays has helped fuel in-depth studies of the human transcriptome. The Affymetrix GeneChip Human Exon 1.0 ST Array (Exon Array) has been previously shown to be effective in profiling gene expression at the isoform level. More recently, the Affymetrix GeneChip Human Gene 1.0 ST Array (Gene Array) has been released for measuring gene expression and interestingly contains a large subset of probes from the Exon Array. Here, we explore the potential of using Gene Array probes to assess expression variation at the sub-transcript level. Utilizing datasets of the high quality Microarray Quality Control (MAQC) RNA samples previously assayed on the Exon Array and Gene Array, we compare the expression measurements of the two platforms to determine the performance of the Gene Array in detecting isoform variations.
Overall, we show that the Gene Array is comparable to the Exon Array in making gene expression calls. Moreover, to examine expression of different isoforms, we modify the Gene Array probe set definition file to enable summarization of probe intensity values at the exon level and show that the expression profiles between the two platforms are also highly correlated. Next, expression calls of previously known differentially spliced genes were compared and also show concordant results. Splicing index analysis, representing estimates of exon inclusion levels, shows a lower but good correlation between platforms. As the Gene Array contains a significant subset of probes from the Exon Array, we note that, in comparison, the Gene Array overlaps with fewer but still a high proportion of splicing events annotated in the Known Alt Events UCSC track, with abundant coverage of cassette exons. We discuss the ability of the Gene Array to detect alternative splicing and isoform variation and address its limitations.
The Gene Array is an effective expression profiling tool at gene and exon expression level, the latter made possible by probe set annotation modifications. We demonstrate that the Gene Array is capable of detecting alternative splicing and isoform variation. As expected, in comparison to the Exon Array, it is limited by reduced gene content coverage and is not able to detect as wide a range of alternative splicing events. However, for the events that can be monitored by both platforms, we estimate that the selectivity and sensitivity levels are comparable. We hope our findings will shed light on the potential extension of the Gene Array to detect alternative splicing. It should be particularly suitable for researchers primarily interested in gene expression analysis, but who may be willing to look for splicing and isoform differences within their dataset. However, we do not suggest it to be an equivalent substitute to the more comprehensive Exon Array.
PMCID: PMC2780461  PMID: 19909511
19.  In vitro homology search array comprehensively reveals highly conserved genes and their functional characteristics in non-sequenced species 
BMC Genomics  2010;11(Suppl 4):S9.
With the increase in genomic and transcriptomic data produced by the recent advancements in next generation sequencers and microarrays, it is now easier than ever to conduct large-scale comparative genomic studies for familiar species. However, there are more than ten million species on earth, and the study of all remaining species is not realistic in terms of cost and time. There have been a number of attempts at using microarrays for cross-species hybridization; however, those approaches only utilized the same probes for each species or different probes designed from orthologous genes. To establish easier and cheaper methods for the large-scale comparative genomic study of non-sequenced species, we developed an in vitro homology search array with the aid of a bioinformatic approach to probe design.
To perform large-scale genomic comparisons of non-sequenced species, we chose squid, one of the most intelligent species among Protostomes, for comparison with human genes. We designed a microarray using human single copy genes and conducted microarray experiments with mRNAs extracted from the squid. Multi-copy genes could not be detected using the microarray in this study because their sequence similarity caused cross-hybridization. A search for squid homologous genes among human genes revealed that 68% of the human probes tested showed the expression of squid homolog genes and 95 genes were confirmed to be expressed highly in squid. Functional classification analysis showed that these highly expressed genes comprise DNA binding proteins, which are under pressure of DNA level mutation and, consequently, show high similarity at the nucleotide level.
Our array could detect homologous genes in squids and humans in spite of the distant phylogenic relationships between the species. This experimental method will be useful for identifying homologs in non-sequenced species, for the development of genetic resources and for the collection of information on biodiversity, particularly when using the genome of sibling or closely related species.
PMCID: PMC3005928  PMID: 21143818
20.  AnyExpress: Integrated toolkit for analysis of cross-platform gene expression data using a fast interval matching algorithm 
BMC Bioinformatics  2011;12:75.
Cross-platform analysis of gene express data requires multiple, intricate processes at different layers with various platforms. However, existing tools handle only a single platform and are not flexible enough to support custom changes, which arise from the new statistical methods, updated versions of reference data, and better platforms released every month or year. Current tools are so tightly coupled with reference information, such as reference genome, transcriptome database, and SNP, which are often erroneous or outdated, that the output results are incorrect and misleading.
We developed AnyExpress, a software package that combines cross-platform gene expression data using a fast interval-matching algorithm. Supported platforms include next-generation-sequencing technology, microarray, SAGE, MPSS, and more. Users can define custom target transcriptome database references for probe/read mapping in any species, as well as criteria to remove undesirable probes/reads.
AnyExpress offers scalable processing features such as binding, normalization, and summarization that are not present in existing software tools.
As a case study, we applied AnyExpress to published Affymetrix microarray and Illumina NGS RNA-Seq data from human kidney and liver. The mean of within-platform correlation coefficient was 0.98 for within-platform samples in kidney and liver, respectively. The mean of cross-platform correlation coefficients was 0.73. These results confirmed those of the original and secondary studies. Applying filtering produced higher agreement between microarray and NGS, according to an agreement index calculated from differentially expressed genes.
AnyExpress can combine cross-platform gene expression data, process data from both open- and closed-platforms, select a custom target reference, filter out undesirable probes or reads based on custom-defined biological features, and perform quantile-normalization with a large number of microarray samples. AnyExpress is fast, comprehensive, flexible, and freely available at
PMCID: PMC3076267  PMID: 21410990
21.  Population Genomics of Parallel Adaptation in Threespine Stickleback using Sequenced RAD Tags 
PLoS Genetics  2010;6(2):e1000862.
Next-generation sequencing technology provides novel opportunities for gathering genome-scale sequence data in natural populations, laying the empirical foundation for the evolving field of population genomics. Here we conducted a genome scan of nucleotide diversity and differentiation in natural populations of threespine stickleback (Gasterosteus aculeatus). We used Illumina-sequenced RAD tags to identify and type over 45,000 single nucleotide polymorphisms (SNPs) in each of 100 individuals from two oceanic and three freshwater populations. Overall estimates of genetic diversity and differentiation among populations confirm the biogeographic hypothesis that large panmictic oceanic populations have repeatedly given rise to phenotypically divergent freshwater populations. Genomic regions exhibiting signatures of both balancing and divergent selection were remarkably consistent across multiple, independently derived populations, indicating that replicate parallel phenotypic evolution in stickleback may be occurring through extensive, parallel genetic evolution at a genome-wide scale. Some of these genomic regions co-localize with previously identified QTL for stickleback phenotypic variation identified using laboratory mapping crosses. In addition, we have identified several novel regions showing parallel differentiation across independent populations. Annotation of these regions revealed numerous genes that are candidates for stickleback phenotypic evolution and will form the basis of future genetic analyses in this and other organisms. This study represents the first high-density SNP–based genome scan of genetic diversity and differentiation for populations of threespine stickleback in the wild. These data illustrate the complementary nature of laboratory crosses and population genomic scans by confirming the adaptive significance of previously identified genomic regions, elucidating the particular evolutionary and demographic history of such regions in natural populations, and identifying new genomic regions and candidate genes of evolutionary significance.
Author Summary
Oceanic threespine stickleback have invaded and adapted to freshwater habitats countless times across the northern hemisphere. These freshwater populations have often evolved in similar ways from the ancestral marine stock from which they independently derived. With the exception of a few identified genes, the genetic basis of this remarkable parallel adaptation is unclear. Here we show that the parallel phenotypic evolution is matched by parallel patterns of nucleotide diversity and population differentiation across the genome. We used a novel high-throughput sequence-based genotyping approach to produce the first high density genome-wide scans of threespine stickleback populations and identified several genomic regions indicative of both divergent and balancing selection. Some of these regions have been associated previously with traits important for freshwater adaptation, but others were previously unidentified. Within these genomic regions we identified candidate genes, laying the foundation for further genetic and functional study of key pathways. This research illustrates the complementary nature of laboratory mapping, functional genetics, and population genomics.
PMCID: PMC2829049  PMID: 20195501
22.  Large scale real-time PCR validation on gene expression measurements from two commercial long-oligonucleotide microarrays 
BMC Genomics  2006;7:59.
DNA microarrays are rapidly becoming a fundamental tool in discovery-based genomic and biomedical research. However, the reliability of the microarray results is being challenged due to the existence of different technologies and non-standard methods of data analysis and interpretation. In the absence of a "gold standard"/"reference method" for the gene expression measurements, studies evaluating and comparing the performance of various microarray platforms have often yielded subjective and conflicting conclusions. To address this issue we have conducted a large scale TaqMan® Gene Expression Assay based real-time PCR experiment and used this data set as the reference to evaluate the performance of two representative commercial microarray platforms.
In this study, we analyzed the gene expression profiles of three human tissues: brain, lung, liver and one universal human reference sample (UHR) using two representative commercial long-oligonucleotide microarray platforms: (1) Applied Biosystems Human Genome Survey Microarrays (based on single-color detection); (2) Agilent Whole Human Genome Oligo Microarrays (based on two-color detection). 1,375 genes represented by both microarray platforms and spanning a wide dynamic range in gene expression levels, were selected for TaqMan® Gene Expression Assay based real-time PCR validation. For each platform, four technical replicates were performed on the same total RNA samples according to each manufacturer's standard protocols. For Agilent arrays, comparative hybridization was performed using incorporation of Cy5 for brain/lung/liver RNA and Cy3 for UHR RNA (common reference). Using the TaqMan® Gene Expression Assay based real-time PCR data set as the reference set, the performance of the two microarray platforms was evaluated focusing on the following criteria: (1) Sensitivity and accuracy in detection of expression; (2) Fold change correlation with real-time PCR data in pair-wise tissues as well as in gene expression profiles determined across all tissues; (3) Sensitivity and accuracy in detection of differential expression.
Our study provides one of the largest "reference" data set of gene expression measurements using TaqMan® Gene Expression Assay based real-time PCR technology. This data set allowed us to use an alternative gene expression technology to evaluate the performance of different microarray platforms. We conclude that microarrays are indeed invaluable discovery tools with acceptable reliability for genome-wide gene expression screening, though validation of putative changes in gene expression remains advisable. Our study also characterizes the limitations of microarrays; understanding these limitations will enable researchers to more effectively evaluate microarray results in a more cautious and appropriate manner.
PMCID: PMC1435885  PMID: 16551369
23.  Dynamic probe selection for studying microbial transcriptome with high-density genomic tiling microarrays 
BMC Bioinformatics  2010;11:82.
Current commercial high-density oligonucleotide microarrays can hold millions of probe spots on a single microscopic glass slide and are ideal for studying the transcriptome of microbial genomes using a tiling probe design. This paper describes a comprehensive computational pipeline implemented specifically for designing tiling probe sets to study microbial transcriptome profiles.
The pipeline identifies every possible probe sequence from both forward and reverse-complement strands of all DNA sequences in the target genome including circular or linear chromosomes and plasmids. Final probe sequence lengths are adjusted based on the maximal oligonucleotide synthesis cycles and best isothermality allowed. Optimal probes are then selected in two stages - sequential and gap-filling. In the sequential stage, probes are selected from sequence windows tiled alongside the genome. In the gap-filling stage, additional probes are selected from the largest gaps between adjacent probes that have already been selected, until a predefined number of probes is reached. Selection of the highest quality probe within each window and gap is based on five criteria: sequence uniqueness, probe self-annealing, melting temperature, oligonucleotide length, and probe position.
The probe selection pipeline evaluates global and local probe sequence properties and selects a set of probes dynamically and evenly distributed along the target genome. Unique to other similar methods, an exact number of non-redundant probes can be designed to utilize all the available probe spots on any chosen microarray platform. The pipeline can be applied to microbial genomes when designing high-density tiling arrays for comparative genomics, ChIP chip, gene expression and comprehensive transcriptome studies.
PMCID: PMC2836303  PMID: 20144223
24.  Brain Transcriptomic Response of Threespine Sticklebacks to Cues of a Predator 
Brain, Behavior and Evolution  2011;77(4):270-285.
Predation pressure represents a strong selective force that influences the development and evolution of living organisms. An increasing number of studies have shown that both environmental and social factors, including exposure to predators, substantially shape the structure and function of the brain. However, our knowledge about the molecular mechanisms underlying the response of the brain to environmental stimuli is limited. In this study, we used whole-genome comparative oligonucleotide microarrays to investigate the brain transcriptomic response to cues of a predator in the threespine stickleback, Gasterosteus aculeatus. We found that repeated exposure to olfactory, visual and tactile cues of a predator (rainbow trout, Oncorrhynchus mykiss) for 6 days resulted in subtle but significant transcriptomic changes in the brain of sticklebacks. Gene functional analysis and gene ontology enrichment revealed that the majority of the transcripts differentially expressed between the fish exposed to cues of a predator and the control group were related to antigen processing and presentation involving the major histocompatibility complex, transmission of synaptic signals, brain metabolic processes, gene regulation and visual perception. The top four identified pathways were synaptic long-term depression, RAN signaling, relaxin signaling and phototransduction. Our study demonstrates that exposure of sticklebacks to cues of a predator results in the activation of a wide range of biological and molecular processes and lays the foundation for future investigations on the molecular factors that modulate the function and evolution of the brain in response to stressors.
PMCID: PMC3182040  PMID: 21677424
Neurogenomics; Stress; Predation; Microarray; Gene expression; Gasterosteus aculeatus
25.  Estimating RNA-quality using GeneChip microarrays 
BMC Genomics  2012;13:186.
Microarrays are a powerful tool for transcriptome analysis. Best results are obtained using high-quality RNA samples for preparation and hybridization. Issues with RNA integrity can lead to low data quality and failure of the microarray experiment.
Microarray intensity data contains information to estimate the RNA quality of the sample. We here study the interplay of the characteristics of RNA surface hybridization with the effects of partly truncated transcripts on probe intensity. The 3′/5′ intensity gradient, the basis of microarray RNA quality measures, is shown to depend on the degree of competitive binding of specific and of non-specific targets to a particular probe, on the degree of saturation of the probes with bound transcripts and on the distance of the probe from the 3′-end of the transcript. Increasing degrees of non-specific hybridization or of saturation reduce the 3′/5′ intensity gradient and if not taken into account, this leads to biased results in common quality measures for GeneChip arrays such as affyslope or the control probe intensity ratio. We also found that short probe sets near the 3′-end of the transcripts are prone to non-specific hybridization presumable because of inaccurate positional assignment and the existence of transcript isoforms with variable 3′ UTRs. Poor RNA quality is associated with a decreased amount of RNA material hybridized on the array paralleled by a decreased total signal level. Additionally, it causes a gene-specific loss of signal due to the positional bias of transcript abundance which requires an individual, gene-specific correction. We propose a new RNA quality measure that considers the hybridization mode. Graphical characteristics are introduced allowing assessment of RNA quality of each single array (‘tongs plot’ and ‘degradation hook’). Furthermore, we suggest a method to correct for effects of RNA degradation on microarray intensities.
The presented RNA degradation measure has best correlation with the independent RNA integrity measure RIN, and therefore presents itself as a valuable tool for quality control and even for the study of RNA degradation. When RNA degradation effects are detected in microarray experiments, a correction of the induced bias in probe intensities is advised.
PMCID: PMC3519671  PMID: 22583818

