Search tips
Search criteria

Results 1-11 (11)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
1.  WISECONDOR: detection of fetal aberrations from shallow sequencing maternal plasma based on a within-sample comparison scheme 
Nucleic Acids Research  2013;42(5):e31.
Genetic disorders can be detected by prenatal diagnosis using Chorionic Villus Sampling, but the 1:100 chance to result in miscarriage restricts the use to fetuses that are suspected to have an aberration. Detection of trisomy 21 cases noninvasively is now possible owing to the upswing of next-generation sequencing (NGS) because a small percentage of fetal DNA is present in maternal plasma. However, detecting other trisomies and smaller aberrations can only be realized using high-coverage NGS, making it too expensive for routine practice. We present a method, WISECONDOR (WIthin-SamplE COpy Number aberration DetectOR), which detects small aberrations using low-coverage NGS. The increased detection resolution was achieved by comparing read counts within the tested sample of each genomic region with regions on other chromosomes that behave similarly in control samples. This within-sample comparison avoids the need to re-sequence control samples. WISECONDOR correctly identified all T13, T18 and T21 cases while coverages were as low as 0.15–1.66. No false positives were identified. Moreover, WISECONDOR also identified smaller aberrations, down to 20 Mb, such as del(13)(q12.3q14.3), +i(12)(p10) and i(18)(q10). This shows that prevalent fetal copy number aberrations can be detected accurately and affordably by shallow sequencing maternal plasma. WISECONDOR is available at
PMCID: PMC3950725  PMID: 24170809
2.  Exploring variation-aware contig graphs for (comparative) metagenomics using MaryGold 
Bioinformatics  2013;29(22):2826-2834.
Motivation: Although many tools are available to study variation and its impact in single genomes, there is a lack of algorithms for finding such variation in metagenomes. This hampers the interpretation of metagenomics sequencing datasets, which are increasingly acquired in research on the (human) microbiome, in environmental studies and in the study of processes in the production of foods and beverages. Existing algorithms often depend on the use of reference genomes, which pose a problem when a metagenome of a priori unknown strain composition is studied. In this article, we develop a method to perform reference-free detection and visual exploration of genomic variation, both within a single metagenome and between metagenomes.
Results: We present the MaryGold algorithm and its implementation, which efficiently detects bubble structures in contig graphs using graph decomposition. These bubbles represent variable genomic regions in closely related strains in metagenomic samples. The variation found is presented in a condensed Circos-based visualization, which allows for easy exploration and interpretation of the found variation.
We validated the algorithm on two simulated datasets containing three respectively seven Escherichia coli genomes and showed that finding allelic variation in these genomes improves assemblies. Additionally, we applied MaryGold to publicly available real metagenomic datasets, enabling us to find within-sample genomic variation in the metagenomes of a kimchi fermentation process, the microbiome of a premature infant and in microbial communities living on acid mine drainage. Moreover, we used MaryGold for between-sample variation detection and exploration by comparing sequencing data sampled at different time points for both of these datasets.
Availability: MaryGold has been written in C++ and Python and can be downloaded from
PMCID: PMC3916741  PMID: 24058058
3.  Analysis of Tumor Heterogeneity and Cancer Gene Networks Using Deep Sequencing of MMTV-Induced Mouse Mammary Tumors 
PLoS ONE  2013;8(5):e62113.
Cancer develops through a multistep process in which normal cells progress to malignant tumors via the evolution of their genomes as a result of the acquisition of mutations in cancer driver genes. The number, identity and mode of action of cancer driver genes, and how they contribute to tumor evolution is largely unknown. This study deployed the Mouse Mammary Tumor Virus (MMTV) as an insertional mutagen to find both the driver genes and the networks in which they function. Using deep insertion site sequencing we identified around 31000 retroviral integration sites in 604 MMTV-induced mammary tumors from mice with mammary gland-specific deletion of Trp53, Pten heterozygous knockout mice, or wildtype strains. We identified 18 known common integration sites (CISs) and 12 previously unknown CISs marking new candidate cancer genes. Members of the Wnt, Fgf, Fgfr, Rspo and Pdgfr gene families were commonly mutated in a mutually exclusive fashion. The sequence data we generated yielded also information on the clonality of insertions in individual tumors, allowing us to develop a data-driven model of MMTV-induced tumor development. Insertional mutations near Wnt and Fgf genes mark the earliest “initiating” events in MMTV induced tumorigenesis, whereas Fgfr genes are targeted later during tumor progression. Our data shows that insertional mutagenesis can be used to discover the mutational networks, the timing of mutations, and the genes that initiate and drive tumor evolution.
PMCID: PMC3653918  PMID: 23690930
4.  Exploring Sequence Characteristics Related to High-Level Production of Secreted Proteins in Aspergillus niger 
PLoS ONE  2012;7(10):e45869.
Protein sequence features are explored in relation to the production of over-expressed extracellular proteins by fungi. Knowledge on features influencing protein production and secretion could be employed to improve enzyme production levels in industrial bioprocesses via protein engineering. A large set, over 600 homologous and nearly 2,000 heterologous fungal genes, were overexpressed in Aspergillus niger using a standardized expression cassette and scored for high versus no production. Subsequently, sequence-based machine learning techniques were applied for identifying relevant DNA and protein sequence features. The amino-acid composition of the protein sequence was found to be most predictive and interpretation revealed that, for both homologous and heterologous gene expression, the same features are important: tyrosine and asparagine composition was found to have a positive correlation with high-level production, whereas for unsuccessful production, contributions were found for methionine and lysine composition. The predictor is available online at Subsequent work aims at validating these findings by protein engineering as a method for increasing expression levels per gene copy.
PMCID: PMC3462195  PMID: 23049690
5.  Integration of Clinical and Gene Expression Data Has a Synergetic Effect on Predicting Breast Cancer Outcome 
PLoS ONE  2012;7(7):e40358.
Breast cancer outcome can be predicted using models derived from gene expression data or clinical data. Only a few studies have created a single prediction model using both gene expression and clinical data. These studies often remain inconclusive regarding an obtained improvement in prediction performance. We rigorously compare three different integration strategies (early, intermediate, and late integration) as well as classifiers employing no integration (only one data type) using five classifiers of varying complexity. We perform our analysis on a set of 295 breast cancer samples, for which gene expression data and an extensive set of clinical parameters are available as well as four breast cancer datasets containing 521 samples that we used as independent validation.mOn the 295 samples, a nearest mean classifier employing a logical OR operation (late integration) on clinical and expression classifiers significantly outperforms all other classifiers. Moreover, regardless of the integration strategy, the nearest mean classifier achieves the best performance. All five classifiers achieve their best performance when integrating clinical and expression data. Repeating the experiments using the 521 samples from the four independent validation datasets also indicated a significant performance improvement when integrating clinical and gene expression data. Whether integration also improves performances on other datasets (e.g. other tumor types) has not been investigated, but seems worthwhile pursuing. Our work suggests that future models for predicting breast cancer outcome should exploit both data types by employing a late OR or intermediate integration strategy based on nearest mean classifiers.
PMCID: PMC3394805  PMID: 22808140
6.  Understanding Regulation of Metabolism through Feasibility Analysis 
PLoS ONE  2012;7(7):e39396.
Understanding cellular regulation of metabolism is a major challenge in systems biology. Thus far, the main assumption was that enzyme levels are key regulators in metabolic networks. However, regulation analysis recently showed that metabolism is rarely controlled via enzyme levels only, but through non-obvious combinations of hierarchical (gene and enzyme levels) and metabolic regulation (mass action and allosteric interaction). Quantitative analyses relating changes in metabolic fluxes to changes in transcript or protein levels have revealed a remarkable lack of understanding of the regulation of these networks. We study metabolic regulation via feasibility analysis (FA). Inspired by the constraint-based approach of Flux Balance Analysis, FA incorporates a model describing kinetic interactions between molecules. We enlarge the portfolio of objectives for the cell by defining three main physiologically relevant objectives for the cell: function, robustness and temporal responsiveness. We postulate that the cell assumes one or a combination of these objectives and search for enzyme levels necessary to achieve this. We call the subspace of feasible enzyme levels the feasible enzyme space. Once this space is constructed, we can study how different objectives may (if possible) be combined, or evaluate the conditions at which the cells are faced with a trade-off among those. We apply FA to the experimental scenario of long-term carbon limited chemostat cultivation of yeast cells, studying how metabolism evolves optimally. Cells employ a mixed strategy composed of increasing enzyme levels for glucose uptake and hexokinase and decreasing levels of the remaining enzymes. This trade-off renders the cells specialized in this low-carbon flux state to compete for the available glucose and get rid of over-overcapacity. Overall, we show that FA is a powerful tool for systems biologists to study regulation of metabolism, interpret experimental data and evaluate hypotheses.
PMCID: PMC3392259  PMID: 22808034
7.  An Evaluation Protocol for Subtype-Specific Breast Cancer Event Prediction 
PLoS ONE  2011;6(7):e21681.
In recent years increasing evidence appeared that breast cancer may not constitute a single disease at the molecular level, but comprises a heterogeneous set of subtypes. This suggests that instead of building a single monolithic predictor, better predictors might be constructed that solely target samples of a designated subtype, which are believed to represent more homogeneous sets of samples. An unavoidable drawback of developing subtype-specific predictors, however, is that a stratification by subtype drastically reduces the number of samples available for their construction. As numerous studies have indicated sample size to be an important factor in predictor construction, it is therefore questionable whether the potential benefit of subtyping can outweigh the drawback of a severe loss in sample size. Factors like unequal class distributions and differences in the number of samples per subtype, further complicate comparisons. We present a novel experimental protocol that facilitates a comprehensive comparison between subtype-specific predictors and predictors that do not take subtype information into account. Emphasis lies on careful control of sample size as well as class and subtype distributions. The methodology is applied to a large breast cancer compendium involving over 1500 arrays, using a state-of-the-art subtyping scheme. We show that the resulting subtype-specific predictors outperform those that do not take subtype information into account, especially when taking sample size considerations into account.
PMCID: PMC3132736  PMID: 21760900
8.  Fewer permutations, more accurate P-values 
Bioinformatics  2009;25(12):i161-i168.
Motivation: Permutation tests have become a standard tool to assess the statistical significance of an event under investigation. The statistical significance, as expressed in a P-value, is calculated as the fraction of permutation values that are at least as extreme as the original statistic, which was derived from non-permuted data. This empirical method directly couples both the minimal obtainable P-value and the resolution of the P-value to the number of permutations. Thereby, it imposes upon itself the need for a very large number of permutations when small P-values are to be accurately estimated. This is computationally expensive and often infeasible.
Results: A method of computing P-values based on tail approximation is presented. The tail of the distribution of permutation values is approximated by a generalized Pareto distribution. A good fit and thus accurate P-value estimates can be obtained with a drastically reduced number of permutations when compared with the standard empirical way of computing P-values.
Availability: The Matlab code can be obtained from the corresponding author on request.
Supplementary information:Supplementary data are available at Bioinformatics online.
PMCID: PMC2687965  PMID: 19477983
9.  Physiological and Transcriptional Responses of Saccharomyces cerevisiae to Zinc Limitation in Chemostat Cultures †  
Applied and Environmental Microbiology  2007;73(23):7680-7692.
Transcriptional responses of the yeast Saccharomyces cerevisiae to Zn availability were investigated at a fixed specific growth rate under limiting and abundant Zn concentrations in chemostat culture. To investigate the context dependency of this transcriptional response and eliminate growth rate-dependent variations in transcription, yeast was grown under several chemostat regimens, resulting in various carbon (glucose), nitrogen (ammonium), zinc, and oxygen supplies. A robust set of genes that responded consistently to Zn limitation was identified, and the set enabled the definition of the Zn-specific Zap1p regulon, comprised of 26 genes and characterized by a broader zinc-responsive element consensus (MHHAACCBYNMRGGT) than so far described. Most surprising was the Zn-dependent regulation of genes involved in storage carbohydrate metabolism. Their concerted down-regulation was physiologically relevant as revealed by a substantial decrease in glycogen and trehalose cellular content under Zn limitation. An unexpectedly large number of genes were synergistically or antagonistically regulated by oxygen and Zn availability. This combinatorial regulation suggested a more prominent involvement of Zn in mitochondrial biogenesis and function than hitherto identified.
PMCID: PMC2168061  PMID: 17933919
10.  Module-Based Outcome Prediction Using Breast Cancer Compendia 
PLoS ONE  2007;2(10):e1047.
The availability of large collections of microarray datasets (compendia), or knowledge about grouping of genes into pathways (gene sets), is typically not exploited when training predictors of disease outcome. These can be useful since a compendium increases the number of samples, while gene sets reduce the size of the feature space. This should be favorable from a machine learning perspective and result in more robust predictors.
We extracted modules of regulated genes from gene sets, and compendia. Through supervised analysis, we constructed predictors which employ modules predictive of breast cancer outcome. To validate these predictors we applied them to independent data, from the same institution (intra-dataset), and other institutions (inter-dataset).
We show that modules derived from single breast cancer datasets achieve better performance on the validation data compared to gene-based predictors. We also show that there is a trend in compendium specificity and predictive performance: modules derived from a single breast cancer dataset, and a breast cancer specific compendium perform better compared to those derived from a human cancer compendium. Additionally, the module-based predictor provides a much richer insight into the underlying biology. Frequently selected gene sets are associated with processes such as cell cycle, E2F regulation, DNA damage response, proteasome and glycolysis. We analyzed two modules related to cell cycle, and the OCT1 transcription factor, respectively. On an individual basis, these modules provide a significant separation in survival subgroups on the training and independent validation data.
PMCID: PMC2002511  PMID: 17940611
11.  New insights on human T cell development by quantitative T cell receptor gene rearrangement studies and gene expression profiling 
The Journal of Experimental Medicine  2005;201(11):1715-1723.
To gain more insight into initiation and regulation of T cell receptor (TCR) gene rearrangement during human T cell development, we analyzed TCR gene rearrangements by quantitative PCR analysis in nine consecutive T cell developmental stages, including CD34+ lin− cord blood cells as a reference. The same stages were used for gene expression profiling using DNA microarrays. We show that TCR loci rearrange in a highly ordered way (TCRD-TCRG-TCRB-TCRA) and that the initiating Dδ2-Dδ3 rearrangement occurs at the most immature CD34+CD38−CD1a− stage. TCRB rearrangement starts at the CD34+CD38+CD1a− stage and complete in-frame TCRB rearrangements were first detected in the immature single positive stage. TCRB rearrangement data together with the PTCRA (pTα) expression pattern show that human TCRβ-selection occurs at the CD34+CD38+CD1a+ stage. By combining the TCR rearrangement data with gene expression data, we identified candidate factors for the initiation/regulation of TCR recombination. Our data demonstrate that a number of key events occur earlier than assumed previously; therefore, human T cell development is much more similar to murine T cell development than reported before.
PMCID: PMC2213269  PMID: 15928199

Results 1-11 (11)