In recent years, imaging based, automated, non-invasive, and non-destructive high-throughput plant phenotyping platforms have become popular tools for plant biology, underpinning the field of plant phenomics. Such platforms acquire and record large amounts of raw data that must be accurately and robustly calibrated, reconstructed, and analysed, requiring the development of sophisticated image understanding and quantification algorithms. The raw data can be processed in different ways, and the past few years have seen the emergence of two main approaches: 2D image processing and 3D mesh processing algorithms. Direct image quantification methods (usually 2D) dominate the current literature due to comparative simplicity. However, 3D mesh analysis provides the tremendous potential to accurately estimate specific morphological features cross-sectionally and monitor them over-time.
In this paper, we present a novel 3D mesh based technique developed for temporal high-throughput plant phenomics and perform initial tests for the analysis of Gossypium hirsutum vegetative growth. Based on plant meshes previously reconstructed from multi-view images, the methodology involves several stages, including morphological mesh segmentation, phenotypic parameters estimation, and plant organs tracking over time. The initial study focuses on presenting and validating the accuracy of the methodology on dicotyledons such as cotton but we believe the approach will be more broadly applicable. This study involved applying our technique to a set of six Gossypium hirsutum (cotton) plants studied over four time-points. Manual measurements, performed for each plant at every time-point, were used to assess the accuracy of our pipeline and quantify the error on the morphological parameters estimated.
By directly comparing our automated mesh based quantitative data with manual measurements of individual stem height, leaf width and leaf length, we obtained the mean absolute errors of 9.34%, 5.75%, 8.78%, and correlation coefficients 0.88, 0.96, and 0.95 respectively. The temporal matching of leaves was accurate in 95% of the cases and the average execution time required to analyse a plant over four time-points was 4.9 minutes. The mesh processing based methodology is thus considered suitable for quantitative 4D monitoring of plant phenotypic features.
Microarray technology provides an efficient means for globally exploring physiological processes governed by the coordinated expression of multiple genes. However, identification of genes differentially expressed in microarray experiments is challenging because of their potentially high type I error rate. Methods for large-scale statistical analyses have been developed but most of them are applicable to two-sample or two-condition data.
We developed a large-scale multiple-group F-test based method, named ranking analysis of F-statistics (RAF), which is an extension of ranking analysis of microarray data (RAM) for two-sample t-test. In this method, we proposed a novel random splitting approach to generate the null distribution instead of using permutation, which may not be appropriate for microarray data. We also implemented a two-simulation strategy to estimate the false discovery rate. Simulation results suggested that it has higher efficiency in finding differentially expressed genes among multiple classes at a lower false discovery rate than some commonly used methods. By applying our method to the experimental data, we found 107 genes having significantly differential expressions among 4 treatments at <0.7% FDR, of which 31 belong to the expressed sequence tags (ESTs), 76 are unique genes who have known functions in the brain or central nervous system and belong to six major functional groups.
Our method is suitable to identify differentially expressed genes among multiple groups, in particular, when sample size is small.
About one-fifth of the genes in the budding yeast are essential for haploid viability and cannot be functionally assessed using standard genetic approaches such as gene deletion. To facilitate genetic analysis of essential genes, we and others have assembled collections of yeast strains expressing temperature-sensitive (ts) alleles of essential genes. To explore the phenotypes caused by essential gene mutation we used a panel of genetically engineered fluorescent markers to explore the morphology of cells in the ts strain collection using high-throughput microscopy. Here, we describe the design and implementation of an online database, PhenoM (Phenomics of yeast Mutants), for storing, retrieving, visualizing and data mining the quantitative single-cell measurements extracted from micrographs of the ts mutant cells. PhenoM allows users to rapidly search and retrieve raw images and their quantified morphological data for genes of interest. The database also provides several data-mining tools, including a PhenoBlast module for phenotypic comparison between mutant strains and a Gene Ontology module for functional enrichment analysis of gene sets showing similar morphological alterations. The current PhenoM version 1.0 contains 78 194 morphological images and 1 909 914 cells covering six subcellular compartments or structures for 775 ts alleles spanning 491 essential genes. PhenoM is freely available at http://phenom.ccbr.utoronto.ca/.
Phenomics is an emerging transdiscipline dedicated to the systematic study of phenotypes on a genome-wide scale. New methods for high-throughput genotyping have changed the priority for biomedical research to phenotyping, but the human phenome is vast and its dimensionality remains unknown. Phenomics research strategies capable of linking genetic variation to public health concerns need to prioritize development of mechanistic frameworks that relate neural systems functioning to human behavior. New approaches to phenotype definition will benefit from crossing neuropsychiatric syndromal boundaries, and defining phenotypic features across multiple levels of expression from proteome to syndrome. The demand for high throughput phenotyping may stimulate a migration from conventional laboratory to web-based assessment of behavior, and this offers the promise of dynamic phenotyping –the iterative refinement of phenotype assays based on prior genotype-phenotype associations. Phenotypes that can be studied across species may provide greatest traction, particularly given rapid development in transgenic modeling. Phenomics research demands vertically integrated research teams, novel analytic strategies and informatics infrastructure to help manage complexity. The Consortium for Neuropsychiatric Phenomics at UCLA has been supported by the NIH Roadmap Initiative to illustrate these principles, and is developing applications that may help investigators assemble, visualize, and ultimately test multi-level phenomics hypotheses. As the transdiscipline of phenomics matures, and work is extended to large-scale international collaborations, there is promise that systematic new knowledgebases will help fulfill the promise of personalized medicine and the rational diagnosis and treatment of neuropsychiatric syndromes.
phenotype; genetics; genomics; informatics; cognition; psychiatry
The broad aim of biomedical science in the postgenomic era is to link genomic and phenotype information to allow deeper understanding of the processes leading from genomic changes to altered phenotype and disease. The EuroPhenome project (http://www.EuroPhenome.org) is a comprehensive resource for raw and annotated high-throughput phenotyping data arising from projects such as EUMODIC. EUMODIC is gathering data from the EMPReSSslim pipeline (http://www.empress.har.mrc.ac.uk/) which is performed on inbred mouse strains and knock-out lines arising from the EUCOMM project. The EuroPhenome interface allows the user to access the data via the phenotype or genotype. It also allows the user to access the data in a variety of ways, including graphical display, statistical analysis and access to the raw data via web services. The raw phenotyping data captured in EuroPhenome is annotated by an annotation pipeline which automatically identifies statistically different mutants from the appropriate baseline and assigns ontology terms for that specific test. Mutant phenotypes can be quickly identified using two EuroPhenome tools: PhenoMap, a graphical representation of statistically relevant phenotypes, and mining for a mutant using ontology terms. To assist with data definition and cross-database comparisons, phenotype data is annotated using combinations of terms from biological ontologies.
We applied a new approach based on Mantel statistics to analyze the Genetic Analysis Workshop 14 simulated data with prior knowledge of the answers. The method was developed in order to improve the power of a haplotype sharing analysis for gene mapping in complex disease. The new statistic correlates genetic similarity and phenotypic similarity across pairs of haplotypes from case-control studies. The genetic similarity is measured as the shared length between haplotype pairs around a genetic marker. The phenotypic similarity is measured as the mean corrected cross-product based on the respective phenotypes. Cases with phenotype P1 and unrelated controls were drawn from the population of Danacaa. Power to detect main effects was compared to the X2-test for association based on 3-marker haplotypes and a global permutation test for haplotype association to test for main effects. Power to detect gene × gene interaction was compared to unconditional logistic regression. The results suggest that the Mantel statistics might be more powerful than alternative tests.
Microarray technology provides a powerful tool for the expression profile of thousands of genes simultaneously, which makes it possible to explore the molecular and metabolic etiology of the development of a complex disease under study. However, classical statistical methods and technologies fail to be applied to microarray data. Therefore, it is necessary and motivated to develop the powerful methods for large-scale statistical analyses. In this paper, we described a novel method, called Ranking Analysis of Microarray data (RAM). RAM, which is a large-scale two-sample t-test method, is based on comparisons between a set of ranked T-statistics and a set of ranked Z-values (a set of ranked estimated null scores) yielded by a “randomly splitting” approach instead of a “permutation” approach and two-simulation strategy for estimating the proportion of genes identified by chance, i.e., the false discovery rate (FDR). The results obtained from the simulated and observed microarray data shows that RAM is more efficient in identification of genes differentially expressed and estimation of FDR under the undesirable conditions such as a large fudge factor, small sample size, or mixture distribution of noises than Significance Analysis of Microarrays (SAM).
Microarray; t-test; ranking analysis; false discovery rate
The recent completion of the Human Genome Project has made possible a high-throughput “systems approach” for accelerating the elucidation of molecular underpinnings of human diseases, and subsequent derivation of molecular-based strategies to more effectively prevent, diagnose, and treat these diseases. Although altered phenotypes are among the most reliable manifestations of altered gene functions, research using systematic analysis of phenotype relationships to study human biology is still in its infancy. This article focuses on the emerging field of high-throughput phenotyping (HTP) phenomics research, which aims to capitalize on novel high-throughput computation and informatics technology developments to derive genomewide molecular networks of genotype–phenotype associations, or “phenomic associations.” The HTP phenomics research field faces the challenge of technological research and development to generate novel tools in computation and informatics that will allow researchers to amass, access, integrate, organize, and manage phenotypic databases across species and enable genomewide analysis to associate phenotypic information with genomic data at different scales of biology. Key state-of-the-art technological advancements critical for HTP phenomics research are covered in this review. In particular, we highlight the power of computational approaches to conduct large-scale phenomics studies.
computational genomics; gene–disease associations; phenomics; phenotype
The aim of this paper is to generalize permutation methods for multiple testing adjustment of significant partial regression coefficients in a linear regression model used for microarray data. Using a permutation method outlined by Anderson and Legendre  and the permutation P-value adjustment from Simon et al. , the significance of disease related gene expression will be determined and adjusted after accounting for the effects of covariates, which are not restricted to be categorical. We apply these methods to a microarray dataset containing confounders and illustrate the comparisons between the permutation-based adjustments and the normal theory adjustments. The application of a linear model is emphasized for data containing confounders and the permutation-based approaches are shown to be better suited for microarray data.
multiple comparisons; gene expression; permutation; linear regression; adjusted P-values
In microarray data analysis, the comparison of gene-expression profiles with respect to different conditions and the selection of biologically interesting genes are crucial tasks. Multivariate statistical methods have been applied to analyze these large datasets. Less work has been published concerning the assessment of the reliability of gene-selection procedures. Here we describe a method to assess reliability in multivariate microarray data analysis using permutation-validated principal components analysis (PCA). The approach is designed for microarray data with a group structure.
We used PCA to detect the major sources of variance underlying the hybridization conditions followed by gene selection based on PCA-derived and permutation-based test statistics. We validated our method by applying it to well characterized yeast cell-cycle data and to two datasets from our laboratory. We could describe the major sources of variance, select informative genes and visualize the relationship of genes and arrays. We observed differences in the level of the explained variance and the interpretability of the selected genes.
Combining data visualization and permutation-based gene selection, permutation-validated PCA enables one to illustrate gene-expression variance between several conditions and to select genes by taking into account the relationship of between-group to within-group variance of genes. The method can be used to extract the leading sources of variance from microarray data, to visualize relationships between genes and hybridizations and to select informative genes in a statistically reliable manner. This selection accounts for the level of reproducibility of replicates or group structure as well as gene-specific scatter. Visualization of the data can support a straightforward biological interpretation.
In this paper, we develop an efficient moments-based permutation test approach to improve the test’s computational efficiency by approximating the permutation distribution of the test statistic with Pearson distribution series. This approach involves the calculation of the first four moments of the permutation distribution. We propose a novel recursive method to derive these moments theoretically and analytically without any permutation. Experimental results using different test statistics are demonstrated using simulated data and real data. The proposed strategy takes advantage of nonparametric permutation tests and parametric Pearson distribution approximation to achieve both accuracy and efficiency.
Phenotypes are investigated in model organisms to understand and reveal the molecular mechanisms underlying disease. Phenotype ontologies were developed to capture and compare phenotypes within the context of a single species. Recently, these ontologies were augmented with formal class definitions that may be utilized to integrate phenotypic data and enable the direct comparison of phenotypes between different species. We have developed a method to transform phenotype ontologies into a formal representation, combine phenotype ontologies with anatomy ontologies, and apply a measure of semantic similarity to construct the PhenomeNET cross-species phenotype network. We demonstrate that PhenomeNET can identify orthologous genes, genes involved in the same pathway and gene–disease associations through the comparison of mutant phenotypes. We provide evidence that the Adam19 and Fgf15 genes in mice are involved in the tetralogy of Fallot, and, using zebrafish phenotypes, propose the hypothesis that the mammalian homologs of Cx36.7 and Nkx2.5 lie in a pathway controlling cardiac morphogenesis and electrical conductivity which, when defective, cause the tetralogy of Fallot phenotype. Our method implements a whole-phenome approach toward disease gene discovery and can be applied to prioritize genes for rare and orphan diseases for which the molecular basis is unknown.
Considering cells as biofactories, we aimed to optimize its internal processes by using the same engineering principles that large industries are implementing nowadays: lean manufacturing. We have applied reverse engineering computational methods to transcriptomic, metabolomic and phenomic data obtained from a collection of tomato recombinant inbreed lines to formulate a kinetic and constraint-based model that efficiently describes the cellular metabolism from expression of a minimal core of genes. Based on predicted metabolic profiles, a close association with agronomic and organoleptic properties of the ripe fruit was revealed with high statistical confidence. Inspired in a synthetic biology approach, the model was used for exploring the landscape of all possible local transcriptional changes with the aim of engineering tomato fruits with fine-tuned biotechnological properties. The method was validated by the ability of the proposed genomes, engineered for modified desired agronomic traits, to recapitulate experimental correlations between associated metabolites.
Considering cells as biofactories, we aimed to optimize their internal processes by using existing design principles acquired from engineering. Herein, we present a synthetic biology approach based on experimental and computational methodology that integrates genomic, transcriptomic, metabolomic and phenomic data to formulate a kinetic and constraint based model of tomato agronomic and fruit quality characteristics. The model has been used for exploring the landscape of all possible local transcriptional changes with the aim of engineering tomato fruits with improved biotechnological properties. The methodology was validated by the ability of the proposed engineered genomes with modified desired agronomic traits, to recapitulate correlations between associated metabolites that are found experimentally in a number of examples.
The antibody microarray is a powerful chip-based technology for profiling hundreds of proteins simultaneously and is used increasingly nowadays. To study humoral response in pancreatic cancers, Patwa et al. (2007) developed a two-dimensional liquid separation technique and built a two-dimensional antibody microarray. However, identifying differential expression regions on the antibody microarray requires the use of appropriate statistical methods to fairly assess the large amounts of data generated. In this paper, we propose a permutation-based test using spatial information of the two-dimensional antibody microarray. By borrowing strength from the neighboring differentially expressed spots, we are able to detect the differential expression region with very high power controlling type I error at 0.05 in our simulation studies. We also apply the proposed methodology to a real microarray dataset.
Antibody Microarray; Permutation; Spatial information
The use of high-throughput sequence data in genetic epidemiology allows the investigation of common and rare variants in the entire genome, thus increasing the amount of information and the potential number of statistical tests performed within one study. As a consequence, the problem of multiple testing may become even more pressing than in previous studies. As an important challenge, the exact number of statistical tests depends on the actual statistical method used. Furthermore, many statistical approaches for the analysis of sequence data require permutation. Thus it may be difficult to also use permutation to estimate correct type I error levels as in genome-wide association studies. In view of this, a separate group at Genetic Analysis Workshop 17 was formed with a focus on multiple testing. Here, we present the approaches used for the workshop. Apart from tackling the multiple testing problem, the new group focused on different issues. Some contributors developed and investigated modifications of existing collapsing methods. Others aimed at improving the identification of functional variants through a reduction and analysis of the underlying data dimensions. Two research groups investigated the overall accumulation of rare variation across the genome and its value in predicting phenotypes. Finally, other investigators left the path of traditional statistical analyses by reversing null and alternative hypotheses and by proposing a novel resampling method. We describe and discuss all these approaches.
next-generation sequencing; resampling; collapsing methods; rare sequence variants
In this paper we consider how to combine neuronal signals from multiple electrodes to optimally predict behavioral choices from observed neural activity. The predictability is often quantified by the area under the receiver operating characteristic (ROC) curve, also called choice probability in neurophysiology. We propose a distribution-free relaxation based multichannel signal combination (RELAX-MUSIC) approach that requires only simple pairwise combination and recursive implementation for optimizing the area under the ROC curve. A permutation test is employed to assess the statistical significance of the derived choice probability. We demonstrate that the RELAX-MUSIC approach outperforms the commonly used response pooling and Fisher linear discriminant (FLD) methods. The excellent performances of the RELAX-MUSIC approach for predicting perceptual decisions from neural activity are demonstrated via examples using simulated and experimental data
multichannel combination; receiver operating characteristic (ROC); Fisher linear discriminant (FLD); structure from motion (SFM); single unit activity (SUA); visual cortex; middle temporal (MT)
In computational biology, permutation tests have become a widely used tool to assess the statistical significance of an event under investigation. However, the common way of computing the P-value, which expresses the statistical significance, requires a very large number of permutations when small (and thus interesting) P-values are to be accurately estimated. This is computationally expensive and often infeasible. Recently, we proposed an alternative estimator, which requires far fewer permutations compared to the standard empirical approach while still reliably estimating small P-values .
The proposed P-value estimator has been enriched with additional functionalities and is made available to the general community through a public website and web service, called EPEPT. This means that the EPEPT routines can be accessed not only via a website, but also programmatically using any programming language that can interact with the web. Examples of web service clients in multiple programming languages can be downloaded. Additionally, EPEPT accepts data of various common experiment types used in computational biology. For these experiment types EPEPT first computes the permutation values and then performs the P-value estimation. Finally, the source code of EPEPT can be downloaded.
Different types of users, such as biologists, bioinformaticians and software engineers, can use the method in an appropriate and simple way.
To extend agricultural productivity by knowledge-based breeding and tailor varieties adapted to specific environmental conditions, it is imperative to improve our ability to assess the dynamic changes of the phenome of crops under field conditions. To this end, we have developed a precision phenotyping platform that combines various sensors for a non-invasive, high-throughput and high-dimensional phenotyping of small grain cereals. This platform yielded high prediction accuracies and heritabilities for biomass of triticale. Genetic variation for biomass accumulation was dissected with 647 doubled haploid lines derived from four families. Employing a genome-wide association mapping approach, two major quantitative trait loci (QTL) for biomass were identified and the genetic architecture of biomass accumulation was found to be characterized by dynamic temporal patterns. Our findings highlight the potential of precision phenotyping to assess the dynamic genetics of complex traits, especially those not amenable to traditional phenotyping.
Modern high-throughput measurement technologies such as DNA microarrays and next generation sequencers produce extensive datasets. With large datasets the emphasis has been moving from traditional statistical tests to new data mining methods that are capable of detecting complex patterns, such as clusters, regulatory networks, or time series periodicity. Study of periodic gene expression is an interesting research question that also is a good example of challenges involved in the analysis of high-throughput data in general. Unlike for classical statistical tests, the distribution of test statistic for data mining methods cannot be derived analytically.
We describe the randomization based approach to significance testing, and show how it can be applied to detect periodically expressed genes. We present four randomization methods, three of which have previously been used for gene cycle data. We propose a new method for testing significance of periodicity in gene expression short time series data, such as from gene cycle and circadian clock studies. We argue that the underlying assumptions behind existing significance testing approaches are problematic and some of them unrealistic. We analyze the theoretical properties of the existing and proposed methods, showing how our method can be robustly used to detect genes with exceptionally high periodicity. We also demonstrate the large differences in the number of significant results depending on the chosen randomization methods and parameters of the testing framework.
By reanalyzing gene cycle data from various sources, we show how previous estimates on the number of gene cycle controlled genes are not supported by the data. Our randomization approach combined with widely adopted Benjamini-Hochberg multiple testing method yields better predictive power and produces more accurate null distributions than previous methods.
Existing methods for testing significance of periodic gene expression patterns are simplistic and optimistic. Our testing framework allows strict levels of statistical significance with more realistic underlying assumptions, without losing predictive power. As DNA microarrays have now become mainstream and new high-throughput methods are rapidly being adopted, we argue that not only there will be need for data mining methods capable of coping with immense datasets, but there will also be need for solid methods for significance testing.
With the development of high-throughput sequencing and genotyping technologies, the number of markers collected in genetic association studies is growing rapidly, increasing the importance of methods for correcting for multiple hypothesis testing. The permutation test is widely considered the gold standard for accurate multiple testing correction, but it is often computationally impractical for these large datasets. Recently, several studies proposed efficient alternative approaches to the permutation test based on the multivariate normal distribution (MVN). However, they cannot accurately correct for multiple testing in genome-wide association studies for two reasons. First, these methods require partitioning of the genome into many disjoint blocks and ignore all correlations between markers from different blocks. Second, the true null distribution of the test statistic often fails to follow the asymptotic distribution at the tails of the distribution. We propose an accurate and efficient method for multiple testing correction in genome-wide association studies—SLIDE. Our method accounts for all correlation within a sliding window and corrects for the departure of the true null distribution of the statistic from the asymptotic distribution. In simulations using the Wellcome Trust Case Control Consortium data, the error rate of SLIDE's corrected p-values is more than 20 times smaller than the error rate of the previous MVN-based methods' corrected p-values, while SLIDE is orders of magnitude faster than the permutation test and other competing methods. We also extend the MVN framework to the problem of estimating the statistical power of an association study with correlated markers and propose an efficient and accurate power estimation method SLIP. SLIP and SLIDE are available at http://slide.cs.ucla.edu.
In genome-wide association studies, it is important to account for the fact that a large number of genetic variants are tested in order to adequately control for false positives. The simplest way to correct for multiple hypothesis testing is the Bonferroni correction, which multiplies the p-values by the number of markers assuming the markers are independent. Since the markers are correlated due to linkage disequilibrium, this approach leads to a conservative estimate of false positives, thus adversely affecting statistical power. The permutation test is considered the gold standard for accurate multiple testing correction, but is often computationally impractical for large association studies. We propose a method that efficiently and accurately corrects for multiple hypotheses in genome-wide association studies by fully accounting for the local correlation structure between markers. Our method also corrects for the departure of the true distribution of test statistics from the asymptotic distribution, which dramatically improves the accuracy, particularly when many rare variants are included in the tests. Our method shows a near identical accuracy to permutation and shows greater computational efficiency than previously suggested methods. We also provide a method to accurately and efficiently estimate the statistical power of genome-wide association studies.
Phenotypes are an important subject of biomedical research for which many repositories have already been created. Most of these databases are either dedicated to a single species or to a single disease of interest. With the advent of technologies to generate phenotypes in a high-throughput manner, not only is the volume of phenotype data growing fast but also the need to organize these data in more useful ways. We have created PhenomicDB (freely available at ), a multi-species genotype/phenotype database, which shows phenotypes associated with their corresponding genes and grouped by gene orthologies across a variety of species. We have enhanced PhenomicDB recently by additionally incorporating quantitative and descriptive RNA interference (RNAi) screening data, by enabling the usage of phenotype ontology terms and by providing information on assays and cell lines. We envision that integration of classical phenotypes with high-throughput data will bring new momentum and insights to our understanding. Modern analysis tools under development may help exploiting this wealth of information to transform it into knowledge and, eventually, into novel therapeutic approaches.
In the absence of randomization, the comparison of an experimental treatment with
respect to the standard may be done based on a matched design. When there is a
limited set of cases receiving the experimental treatment, matching of a proper
set of controls in a non fixed proportion is convenient.
In order to deal with the highly stratified survival data generated by multiple
matching, we extend the multivariate permutation testing approach, since standard
nonparametric methods for the comparison of survival curves cannot be applied in
We demonstrate the validity of the proposed method with simulations, and we
illustrate its application to data from an observational study for the comparison
of bone marrow transplantation and chemotherapy in the treatment of paediatric
The use of the multivariate permutation testing approach is recommended in the
highly stratified context of survival matched data, especially when the
proportional hazards assumption does not hold.
Highly stratified data; Matched survival data; Multiple matching; Multivariate permutation tests
With the advent of "omics" (e.g. genomics, transcriptomics, proteomics and phenomics), studies can produce enormous amounts of data. Managing this diverse data and integrating with other biological data are major challenges for the bioinformatics community. Comprehensive new tools are needed to store, integrate and analyze the data efficiently.
The PhenoGen Informatics website is a comprehensive toolbox for storing, analyzing and integrating microarray data and related genotype and phenotype data. The site is particularly suited for combining QTL and microarray data to search for "candidate" genes contributing to complex traits. In addition, the site allows, if desired by the investigators, sharing of the data. Investigators can conduct "in-silico" microarray experiments using their own and/or "shared" data.
The PhenoGen website provides access to tools that can be used for high-throughput data storage, analyses and interpretation of the results. Some of the advantages of the architecture of the website are that, in the future, the present set of tools can be adapted for the analyses of any type of high-throughput "omics" data, and that access to new tools, available in the public domain or developed at PhenoGen, can be easily provided.
Connecting genotype to phenotype is fundamental in biomedical research and in our understanding of disease. Phenomics—the large-scale quantitative phenotypic analysis of genotypes on a genome-wide scale—connects automated data generation with the development of novel tools for phenotype data integration, mining and visualization. Our yeast phenomics database PROPHECY is available at . Via phenotyping of 984 heterozygous diploids for all essential genes the genotypes analysed and presented in PROPHECY have been extended and now include all genes in the yeast genome. Further, phenotypic data from gene overexpression of 574 membrane spanning proteins has recently been included. To facilitate the interpretation of quantitative phenotypic data we have developed a new phenotype display option, the Comparative Growth Curve Display, where growth curve differences for a large number of mutants compared with the wild type are easily revealed. In addition, PROPHECY now offers a more informative and intuitive first-sight display of its phenotypic data via its new summary page. We have also extended the arsenal of data analysis tools to include dynamic visualization of phenotypes along individual chromosomes. PROPHECY is an initiative to enhance the growing field of phenome bioinformatics.
Using the exome sequencing data from 697 unrelated individuals and their simulated disease phenotypes from Genetic Analysis Workshop 17, we develop and apply a gene-based method to identify the relationship between a gene with multiple rare genetic variants and a phenotype. The method is based on the Mantel test, which assesses the correlation between two distance matrices using a permutation procedure. Using up to 100,000 permutations to estimate the statistical significance in 200 replicate data sets, we found that the method had 5.1% type I error at an α level of 0.05 and had various power to detect genes with simulated genetic associations. FLT1 and KDR had the most significant correlations with Q1 and were replicated 170 and 24 times, respectively, in 200 simulated data sets using a Bonferroni corrected p-value of 0.05 as a threshold. These results suggest that the distance correlation method can be used to identify genotype-phenotype association when multiple rare genetic variants in a gene are involved.