Search tips
Search criteria

Results 1-5 (5)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
Document Types
1.  An Extensive Evaluation of Read Trimming Effects on Illumina NGS Data Analysis 
PLoS ONE  2013;8(12):e85024.
Next Generation Sequencing is having an extremely strong impact in biological and medical research and diagnostics, with applications ranging from gene expression quantification to genotyping and genome reconstruction. Sequencing data is often provided as raw reads which are processed prior to analysis 1 of the most used preprocessing procedures is read trimming, which aims at removing low quality portions while preserving the longest high quality part of a NGS read. In the current work, we evaluate nine different trimming algorithms in four datasets and three common NGS-based applications (RNA-Seq, SNP calling and genome assembly). Trimming is shown to increase the quality and reliability of the analysis, with concurrent gains in terms of execution time and computational resources needed.
PMCID: PMC3871669  PMID: 24376861
2.  Transcriptome sequencing and microarray design for functional genomics in the extremophile Arabidopsis relative Thellungiella salsuginea (Eutrema salsugineum) 
BMC Genomics  2013;14:793.
Most molecular studies of plant stress tolerance have been performed with Arabidopsis thaliana, although it is not particularly stress tolerant and may lack protective mechanisms required to survive extreme environmental conditions. Thellungiella salsuginea has attracted interest as an alternative plant model species with high tolerance of various abiotic stresses. While the T. salsuginea genome has recently been sequenced, its annotation is still incomplete and transcriptomic information is scarce. In addition, functional genomics investigations in this species are severely hampered by a lack of affordable tools for genome-wide gene expression studies.
Here, we report the results of Thellungiella de novo transcriptome assembly and annotation based on 454 pyrosequencing and development and validation of a T. salsuginea microarray. ESTs were generated from a non-normalized and a normalized library synthesized from RNA pooled from samples covering different tissues and abiotic stress conditions. Both libraries yielded partially unique sequences, indicating their necessity to obtain comprehensive transcriptome coverage. More than 1 million sequence reads were assembled into 42,810 unigenes, approximately 50% of which could be functionally annotated. These unigenes were compared to all available Thellungiella genome sequence information. In addition, the groups of Late Embryogenesis Abundant (LEA) proteins, Mitogen Activated Protein (MAP) kinases and protein phosphatases were annotated in detail. We also predicted the target genes for 384 putative miRNAs. From the sequence information, we constructed a 44 k Agilent oligonucleotide microarray. Comparison of same-species and cross-species hybridization results showed superior performance of the newly designed array for T. salsuginea samples. The developed microarrays were used to investigate transcriptional responses of T. salsuginea and Arabidopsis during cold acclimation using the MapMan software.
This study provides the first comprehensive transcriptome information for the extremophile Arabidopsis relative T. salsuginea. The data constitute a more than three-fold increase in the number of publicly available unigene sequences and will greatly facilitate genome annotation. In addition, we have designed and validated the first genome-wide microarray for T. salsuginea, which will be commercially available. Together with the publicly available MapMan software this will become an important tool for functional genomics of plant stress tolerance.
PMCID: PMC3832907  PMID: 24228715
Arabidopsis thaliana; Cold acclimation; Gene annotation; LEA proteins; MAP kinases; Microarray design; microRNAs; Protein phosphatases; Thellungiella salsuginea; Transcriptome sequencing
3.  SLocX: Predicting Subcellular Localization of Arabidopsis Proteins Leveraging Gene Expression Data 
Despite the growing volume of experimentally validated knowledge about the subcellular localization of plant proteins, a well performing in silico prediction tool is still a necessity. Existing tools, which employ information derived from protein sequence alone, offer limited accuracy and/or rely on full sequence availability. We explored whether gene expression profiling data can be harnessed to enhance prediction performance. To achieve this, we trained several support vector machines to predict the subcellular localization of Arabidopsis thaliana proteins using sequence derived information, expression behavior, or a combination of these data and compared their predictive performance through a cross-validation test. We show that gene expression carries information about the subcellular localization not available in sequence information, yielding dramatic benefits for plastid localization prediction, and some notable improvements for other compartments such as the mitochondrion, the Golgi, and the plasma membrane. Based on these results, we constructed a novel subcellular localization prediction engine, SLocX, combining gene expression profiling data with protein sequence-based information. We then validated the results of this engine using an independent test set of annotated proteins and a transient expression of GFP fusion proteins. Here, we present the prediction framework and a website of predicted localizations for Arabidopsis. The relatively good accuracy of our prediction engine, even in cases where only partial protein sequence is available (e.g., in sequences lacking the N-terminal region), offers a promising opportunity for similar application to non-sequenced or poorly annotated plant species. Although the prediction scope of our method is currently limited by the availability of expression information on the ATH1 array, we believe that the advances in measuring gene expression technology will make our method applicable for all Arabidopsis proteins.
PMCID: PMC3355584  PMID: 22639594
subcellular localization; support vector machine; prediction; gene expression
4.  Genomic and transcriptomic analysis of the AP2/ERF superfamily in Vitis vinifera 
BMC Genomics  2010;11:719.
The AP2/ERF protein family contains transcription factors that play a crucial role in plant growth and development and in response to biotic and abiotic stress conditions in plants. Grapevine (Vitis vinifera) is the only woody crop whose genome has been fully sequenced. So far, no detailed expression profile of AP2/ERF-like genes is available for grapevine.
An exhaustive search for AP2/ERF genes was carried out on the Vitis vinifera genome and their expression profile was analyzed by Real-Time quantitative PCR (qRT-PCR) in different vegetative and reproductive tissues and under two different ripening stages.
One hundred and forty nine sequences, containing at least one ERF domain, were identified. Specific clusters within the AP2 and ERF families showed conserved expression patterns reminiscent of other species and grapevine specific trends related to berry ripening. Moreover, putative targets of group IX ERFs were identified by co-expression and protein similarity comparisons.
The grapevine genome contains an amount of AP2/ERF genes comparable to that of other dicot species analyzed so far. We observed an increase in the size of specific groups within the ERF family, probably due to recent duplication events. Expression analyses in different aerial tissues display common features previously described in other plant systems and introduce possible new roles for members of some ERF groups during fruit ripening. The presented analysis of AP2/ERF genes in grapevine provides the bases for studying the molecular regulation of berry development and the ripening process.
PMCID: PMC3022922  PMID: 21171999
5.  Algorithm-driven Artifacts in median polish summarization of Microarray data 
BMC Bioinformatics  2010;11:553.
High-throughput measurement of transcript intensities using Affymetrix type oligonucleotide microarrays has produced a massive quantity of data during the last decade. Different preprocessing techniques exist to convert the raw signal intensities measured by these chips into gene expression estimates. Although these techniques have been widely benchmarked in the context of differential gene expression analysis, there are only few examples where their performance has been assessed in respect to coexpression-based studies such as sample classification.
In the present paper we benchmark the three most used normalization procedures (MAS5, RMA and GCRMA) in the context of inter-array correlation analysis, confirming and extending the finding that RMA and GCRMA consistently overestimate sample similarity upon normalization. We determine that median polish summarization is responsible for generating a large proportion of these over-similarity artifacts. Furthermore, we show that most affected probesets show also internal signal disagreement, and tend to be composed by individual probes hitting different gene transcripts. We finally provide a correction to the RMA/GCRMA summarization procedure that massively reduces inter-array correlation artifacts, without affecting the detection of differentially expressed genes.
We propose tRMA as a modification of RMA to normalize microarray experiments for correlation-based analysis.
PMCID: PMC2998528  PMID: 21070630

Results 1-5 (5)