1.  Performance comparison of four exome capture systems for deep sequencing 
BMC Genomics  2014;15(1):449.
Recent developments in deep (next-generation) sequencing technologies are significantly impacting medical research. The global analysis of protein coding regions in genomes of interest by whole exome sequencing is a widely used application. Many technologies for exome capture are commercially available; here we compare the performance of four of them: NimbleGen’s SeqCap EZ v3.0, Agilent’s SureSelect v4.0, Illumina’s TruSeq Exome, and Illumina’s Nextera Exome, all applied to the same human tumor DNA sample.
Each capture technology was evaluated for its coverage of different exome databases, target coverage efficiency, GC bias, sensitivity in single nucleotide variant detection, sensitivity in small indel detection, and technical reproducibility. In general, all technologies performed well; however, our data demonstrated small, but consistent differences between the four capture technologies. Illumina technologies cover more bases in coding and untranslated regions. Furthermore, whereas most of the technologies provide reduced coverage in regions with low or high GC content, the Nextera technology tends to bias towards target regions with high GC content.
We show key differences in performance between the four technologies. Our data should help researchers who are planning exome sequencing to select appropriate exome capture technology for their particular application.
PMCID: PMC4092227  PMID: 24912484
Exome capture technology; Next-generation sequencing; Coverage efficiency; Enrichment efficiency; GC bias; Single nucleotide variant; Indel
2.  The differential disease regulome 
BMC Genomics  2011;12:353.
Transcription factors in disease-relevant pathways represent potential drug targets, by impacting a distinct set of pathways that may be modulated through gene regulation. The influence of transcription factors is typically studied on a per disease basis, and no current resources provide a global overview of the relations between transcription factors and disease. Furthermore, existing pipelines for related large-scale analysis are tailored for particular sources of input data, and there is a need for generic methodology for integrating complementary sources of genomic information.
We here present a large-scale analysis of multiple diseases versus multiple transcription factors, with a global map of over-and under-representation of 446 transcription factors in 1010 diseases. This map, referred to as the differential disease regulome, provides a first global statistical overview of the complex interrelationships between diseases, genes and controlling elements. The map is visualized using the Google map engine, due to its very large size, and provides a range of detailed information in a dynamic presentation format.
The analysis is achieved through a novel methodology that performs a pairwise, genome-wide comparison on the cartesian product of two distinct sets of annotation tracks, e.g. all combinations of one disease and one TF.
The methodology was also used to extend with maps using alternative data sets related to transcription and disease, as well as data sets related to Gene Ontology classification and histone modifications. We provide a web-based interface that allows users to generate other custom maps, which could be based on precisely specified subsets of transcription factors and diseases, or, in general, on any categorical genome annotation tracks as they are improved or become available.
We have created a first resource that provides a global overview of the complex relations between transcription factors and disease. As the accuracy of the disease regulome depends mainly on the quality of the input data, forthcoming ChIP-seq based binding data for many TFs will provide improved maps. We further believe our approach to genome analysis could allow an advance from the current typical situation of one-time integrative efforts to reproducible and upgradable integrative analysis. The differential disease regulome and its associated methodology is available at
PMCID: PMC3160420  PMID: 21736759
3.  Large-scale inference of the point mutational spectrum in human segmental duplications 
BMC Genomics  2009;10:43.
Recent segmental duplications are relatively large (≥ 1 kb) genomic regions of high sequence identity (≥ 90%). They cover approximately 4–5% of the human genome and play important roles in gene evolution and genomic disease. The DNA sequence differences between copies of a segmental duplication represent the result of various mutational events over time, since any two duplication copies originated from the same ancestral DNA sequence. Based on this fact, we have developed a computational scheme for inference of point mutational events in human segmental duplications, which we collectively term duplication-inferred mutations (DIMs). We have characterized these nucleotide substitutions by comparing them with high-quality SNPs from dbSNP, both in terms of sequence context and frequency of substitution types.
Overall, DIMs show a lower ratio of transitions relative to transversions than SNPs, although this ratio approaches that of SNPs when considering DIMs within most recent duplications. Our findings indicate that DIMs and SNPs in general are caused by similar mutational mechanisms, with some deviances at the CpG dinucleotide. Furthermore, we discover a large number of reference SNPs that coincide with computationally inferred DIMs. The latter reflects how sequence variation in duplicated sequences can be misinterpreted as ordinary allelic variation.
In summary, we show how DNA sequence analysis of segmental duplications can provide a genome-wide mutational spectrum that mirrors recent genome evolution. The inferred set of nucleotide substitutions represents a valuable complement to SNPs for the analysis of genetic variation and point mutagenesis.
PMCID: PMC2640414  PMID: 19161616
4.  Validation of oligoarrays for quantitative exploration of the transcriptome 
BMC Genomics  2008;9:258.
Oligoarrays have become an accessible technique for exploring the transcriptome, but it is presently unclear how absolute transcript data from this technique compare to the data achieved with tag-based quantitative techniques, such as massively parallel signature sequencing (MPSS) and serial analysis of gene expression (SAGE). By use of the TransCount method we calculated absolute transcript concentrations from spotted oligoarray intensities, enabling direct comparisons with tag counts obtained with MPSS and SAGE. The tag counts were converted to number of transcripts per cell by assuming that the sum of all transcripts in a single cell was 5·105. Our aim was to investigate whether the less resource demanding and more widespread oligoarray technique could provide data that were correlated to and had the same absolute scale as those obtained with MPSS and SAGE.
A number of 1,777 unique transcripts were detected in common for the three technologies and served as the basis for our analyses. The correlations involving the oligoarray data were not weaker than, but, similar to the correlation between the MPSS and SAGE data, both when the entire concentration range was considered and at high concentrations. The data sets were more strongly correlated at high transcript concentrations than at low concentrations. On an absolute scale, the number of transcripts per cell and gene was generally higher based on oligoarrays than on MPSS and SAGE, and ranged from 1.6 to 9,705 for the 1,777 overlapping genes. The MPSS data were on same scale as the SAGE data, ranging from 0.5 to 3,180 (MPSS) and 9 to1,268 (SAGE) transcripts per cell and gene. The sum of all transcripts per cell for these genes was 3.8·105 (oligoarrays), 1.1·105 (MPSS) and 7.6·104 (SAGE), whereas the corresponding sum for all detected transcripts was 1.1·106 (oligoarrays), 2.8·105 (MPSS) and 3.8·105 (SAGE).
The oligoarrays and TransCount provide quantitative transcript concentrations that are correlated to MPSS and SAGE data, but, the absolute scale of the measurements differs across the technologies. The discrepancy questions whether the sum of all transcripts within a single cell might be higher than the number of 5·105 suggested in the literature and used to convert tag counts to transcripts per cell. If so, this may explain the apparent higher transcript detection efficiency of the oligoarrays, and has to be clarified before absolute transcript concentrations can be interchanged across the technologies. The ability to obtain transcript concentrations from oligoarrays opens up the possibility of efficient generation of universal transcript databases with low resource demands.
PMCID: PMC2430212  PMID: 18513391
5.  Mapping of oxidative stress responses of human tumor cells following photodynamic therapy using hexaminolevulinate 
BMC Genomics  2007;8:273.
Photodynamic therapy (PDT) involves systemic or topical administration of a lesion-localizing photosensitizer or its precursor, followed by irradiation of visible light to cause singlet oxygen-induced damage to the affected tissue. A number of mechanisms seem to be involved in the protective responses to PDT, including activation of transcription factors, heat shock proteins, antioxidant enzymes and apoptotic pathways.
In this study, we address the effects of a destructive/lethal hexaminolevulinate (HAL) mediated PDT dose on the transcriptome by using transcriptional exon evidence oligo microarrays. Here, we confirm deviations in the steady state expression levels of previously identified early defence response genes and extend this to include unreported PDT inducible gene groups, most notably the metallothioneins and histones. HAL-PDT mediated stress also altered expression of genes encoded by mitochondrial DNA (mtDNA). Further, we report PDT stress induced alternative splicing. Specifically, the ATF3 alternative isoform (deltaZip2) was up-regulated, while the full-length variant was not changed by the treatment. Results were independently verified by two different technological microarray platforms. Good microarray, RT-PCR and Western immunoblotting correlation for selected genes support these findings.
Here, we report new insights into how destructive/lethal PDT alters the transcriptome not only at the transcriptional level but also at post-transcriptional level via alternative splicing.
PMCID: PMC2045114  PMID: 17692132
6.  Comparison of hybridization-based and sequencing-based gene expression technologies on biological replicates 
BMC Genomics  2007;8:153.
High-throughput systems for gene expression profiling have been developed and have matured rapidly through the past decade. Broadly, these can be divided into two categories: hybridization-based and sequencing-based approaches. With data from different technologies being accumulated, concerns and challenges are raised about the level of agreement across technologies. As part of an ongoing large-scale cross-platform data comparison framework, we report here a comparison based on identical samples between one-dye DNA microarray platforms and MPSS (Massively Parallel Signature Sequencing).
The DNA microarray platforms generally provided highly correlated data, while moderate correlations between microarrays and MPSS were obtained. Disagreements between the two types of technologies can be attributed to limitations inherent to both technologies. The variation found between pooled biological replicates underlines the importance of exercising caution in identification of differential expression, especially for the purposes of biomarker discovery.
Based on different principles, hybridization-based and sequencing-based technologies should be considered complementary to each other, rather than competitive alternatives for measuring gene expression, and currently, both are important tools for transcriptome profiling.
PMCID: PMC1899500  PMID: 17555589
7.  Limitations of mRNA amplification from small-size cell samples 
BMC Genomics  2005;6:147.
Global mRNA amplification has become a widely used approach to obtain gene expression profiles from limited material. An important concern is the reliable reflection of the starting material in the results obtained. This is especially important with extremely low quantities of input RNA where stochastic effects due to template dilution may be present. This aspect remains under-documented in the literature, as quantitative measures of data reliability are most often lacking. To address this issue, we examined the sensitivity levels of each transcript in 3 different cell sample sizes. ANOVA analysis was used to estimate the overall effects of reduced input RNA in our experimental design. In order to estimate the validity of decreasing sample sizes, we examined the sensitivity levels of each transcript by applying a novel model-based method, TransCount.
From expression data, TransCount provided estimates of absolute transcript concentrations in each examined sample. The results from TransCount were used to calculate the Pearson correlation coefficient between transcript concentrations for different sample sizes. The correlations were clearly transcript copy number dependent. A critical level was observed where stochastic fluctuations became significant. The analysis allowed us to pinpoint the gene specific number of transcript templates that defined the limit of reliability with respect to number of cells from that particular source. In the sample amplifying from 1000 cells, transcripts expressed with at least 121 transcripts/cell were statistically reliable and for 250 cells, the limit was 1806 transcripts/cell. Above these thresholds, correlation between our data sets was at acceptable values for reliable interpretation.
These results imply that the reliability of any amplification experiment must be validated empirically to justify that any gene exists in sufficient quantity in the input material. This finding has important implications for any experiment where only extremely small samples such as single cell analyses or laser captured microdissected cells are available.
PMCID: PMC1310617  PMID: 16253144
8.  Profound influence of microarray scanner characteristics on gene expression ratios: analysis and procedure for correction 
BMC Genomics  2004;5:10.
High throughput gene expression data from spotted cDNA microarrays are collected by scanning the signal intensities of the corresponding spots by dedicated fluorescence scanners. The major scanner settings for increasing the spot intensities are the laser power and the voltage of the photomultiplier tube (PMT). It is required that the expression ratios are independent of these settings. We have investigated the relationships between PMT voltage, spot intensities, and expression ratios for different scanners, in order to define an optimal scanning procedure.
All scanners showed a limited intensity range from 200 to 50 000 (mean spot intensity), for which the expression ratios were independent of PMT voltage. This usable intensity range was considerably less than the maximum detection range of the PMTs. The use of spot and background intensities outside this range led to errors in the ratios. The errors at high intensities were caused by saturation of pixel intensities within the spots. An algorithm was developed to correct the intensities of these spots, and, hence, extend the upper limit of the usable intensity range.
It is suggested that the PMT voltage should be increased to avoid intensities of the weakest spots below the usable range, allowing the brightest spots to reach the level of saturation. Subsequently, a second set of images should be acquired with a lower PMT setting such that no pixels are in saturation. Reliable data for spots with saturation in the first set of images can easily be extracted from the second set of images by the use of our algorithm. This procedure would lead to an increase in the accuracy of the data and in the number of data points achieved in each experiment compared to traditional procedures.
PMCID: PMC356910  PMID: 15018648
9.  Effects of mRNA amplification on gene expression ratios in cDNA experiments estimated by analysis of variance 
BMC Genomics  2003;4:11.
A limiting factor of cDNA microarray technology is the need for a substantial amount of RNA per labeling reaction. Thus, 20–200 micro-grams total RNA or 0.5–2 micro-grams poly (A) RNA is typically required for monitoring gene expression. In addition, gene expression profiles from large, heterogeneous cell populations provide complex patterns from which biological data for the target cells may be difficult to extract. In this study, we chose to investigate a widely used mRNA amplification protocol that allows gene expression studies to be performed on samples with limited starting material. We present a quantitative study of the variation and noise present in our data set obtained from experiments with either amplified or non-amplified material.
Using analysis of variance (ANOVA) and multiple hypothesis testing, we estimated the impact of amplification on the preservation of gene expression ratios. Both methods showed that the gene expression ratios were not completely preserved between amplified and non-amplified material. We also compared the expression ratios between the two cell lines for the amplified material with expression ratios between the two cell lines for the non-amplified material for each gene. With the aid of multiple t-testing with a false discovery rate of 5%, we found that 10% of the genes investigated showed significantly different expression ratios.
Although the ratios were not fully preserved, amplification may prove to be extremely useful with respect to characterizing low expressing genes.
PMCID: PMC153514  PMID: 12659661
mRNA amplification; microarray; gene expression; multiple hypothesis testing; linear mixed effects model

