In the “good old days” of neuropsychiatric genetics research -two years ago -the literature was bustling with reports suggesting association of “candidate genes” with a range of phenotypes. Some of the phenotypes were traditional diagnostic categories, while others were “endophenotypes”, including cognitive, electrophysiological, or neuroimaging measures. Now that we have reports from the first generation of genome-wide association studies (GWAS), the landscape is markedly different. Together with some disappointing failures to replicate prior candidate gene studies, the GWAS have prompted reconsideration and cast skepticism on findings that are not significant at a genome-wide level, and to consider findings “significant” today, suggested probability levels range from p
(for more specific guidelines, see Freimer and Sabatti, 2004, Freimer and Sabatti, 2005). Some suggest that we currently possess no strong candidate genetic loci for neuropsychiatry research (Flint and Munafo, 2007
). Meanwhile, recent GWAS outside neuropsychiatry are providing novel leads at a rate that strains human capacity for comprehension, and prompting critical re-evaluation of basic strategies for discovery in biomedicine (Frayling, 2007
Initial GWAS results already make two critical points for neuropsychiatry research: (1) if we stick with conventional diagnostic categories as phenotypes, we are going to need very large samples to detect very small effects; (2) even if we are successful in defining “endophenotypes” or intermediate phenotypes, it remains unclear that these will possess “simpler genetic architecture.” Given the large number and scale of GWAS now targeting neuropsychiatric phenotypes, it seems inevitable that we will soon possess a large number of leads - both genetic and phenotypic - that will require follow-up. If there are hundreds, or perhaps thousands of genetic variants involved (not to mention their interactions with both other genetic variants and environmental effects), and myriad phenotypes to consider, how will we prioritize leads for mechanistic research that will inform theories of pathophysiology and the development of rational treatments?
Bioinformatics strategies are helping develop a “bottom-up” scaffold enabling researchers to make the initial steps upwards from the human genome to the human proteome, and already can help constrain hypotheses with knowledge of the biochemical signaling pathways that have been associated with key genomic variants. But the ultimate vision of connecting this basic biological knowledge to the “higher” and more complex phenotypes comprising neural systems functions, cognitive abilities, neuropsychiatric symptoms and syndromes, will require more systematic and formal descriptions of these entities and their relations to each other and their putative biological underpinnings. Much as the Gene Ontologies project has succeeded in advancing genomics research, we suggest that cognitive ontologies can help advance research in neuropsychiatric phenomics - the systematic study of neuropsychiatric phenotypes on a genome-wide scale.
New tools to manage complexity can help increase our odds of making bets with solid payoffs in biomedical discovery. We may soon be poised to overcome current obstacles in genotyping, including detection of copy number variations and rare variants; even epigenome-wide arrays may be available. But we are likely still to confront significant obstacles converting large lists of genetic variants into tractable biological research programs that will identify mechanisms by which these genes exert their effects on complex systems and syndromal levels. Within the Consortium for Neuropsychiatric Phenomics at UCLA (www.phenomics.ucla.edu
), we have used a simple schematic scaffold for translational neuropsychiatric research from genome to syndrome, using seven levels (see ). Seven levels were selected not because we believe this reflects accurately the vast terrain connecting genome to syndrome, but rather because humans generally have difficulty maintaining in mind or discerning a larger number of categories.
Simplified schematic of multilveled “-omics” domains for cognitive neuropsychiatry.
Even given this dramatic simplification, it is easy to see that relating genome to syndrome with rational mechanistic hypotheses is a vast task. Kendler graphically highlighted that genotype-to-phenotype relations are “many-to-many” (i.e., any given genotype influences multiple phenotypes (pleiotropy), and a given phenotype has multiple genotypic contributions)(Kendler, 2005
). But these many-to-many relations exist between each level
as we proceed from genome to syndrome (i.e., a single protein may affect multiple cellular systems, and many proteins are required in any one cellular system; a single kind of cellular system has ramifications in multiple neural systems, and each neural system depends on manifold cellular systems; and so on). The immediate conclusion is that there can be exponential expansion of effects as one follows paths up the hierarchy (pleiotropic expansion) and that any given phenotype may be affected by very large numbers of genes (polygenic expansion).
Simple back-of-the-napkin calculations help put in perspective the scope of the challenges. Starting with the simplest assumption that there is a single gene with syndromal manifestations, and conservative estimates of pleiotropy at each level (i.e., that a single variant may affect p number of proteins, that each protein affects c number of cellular systems or signaling pathways, and so on), the influence of any single gene explodes through the higher phenomic levels. For example, a single gene with 5-fold expansion across each of 7 levels of phenomic expression will yield 15,625 effects. If the expansion were 10-fold at each level, there would a million syndromal variants for each genetic variant.
This expansion logic also applies to top-down analyses of polygenic expansion. Imagine we seek genetic contributions to a single syndrome. Assuming the syndrome is defined by some number s of symptoms, and that each symptom has p cognitive underpinnings, and so forth, it is easy to appreciate that thousands of genes may contribute to complex phenotypes, and indeed that some brain-related phenotypes may be affected by substantial portions of the entire genome.
Similar conclusions are reached by computing the shared variance between levels in a 7-level hierarchy. For example, say we are interested in finding a gene that explains 25% of the variance in a complex syndromal phenotype. This demands that the average shared variance between levels is 80%, or in other words that each level must correlate approximately .9 with the next level. A more conservative but still optimistic estimate of 50% shared variance between levels (still demanding a correlation of .7, which is not far from the upward limits of reliability for psychiatric diagnosis, symptom rating scales, and cognitive phenotype measurements), yields shared variance between gene and syndrome of 1.6%. This might be considered the best case, possibly rational scenario. A more rational scenario is that in which approximately 20% of variance is shared across levels. This is more consistent so far with the typical correlations and effect sizes seen in psychiatry research, where correlations of cognitive measures to symptoms, or brain imaging parameters to cognitive dimensions, are not infrequently in the range of .4 to .5. In this scenario the shared variance of a genetic variant with a complex syndromal variant is .01%. A simple additive genetic model would thus require some 5000 genetic variants to explain a heritability of 50%, which is in the range of that observed for many cognitive, personality, or diagnostic phenotypes. The idea that this last scenario is more realistic than either of the former ones is supported by some reviews suggesting that genetic variation may explain only ~20% of variation at the level of the transcript
, and that less than 2.5% of variance was shared with “higher” phenotypes, regardless of their putative proximity to the genetic level (Flint and Munafo, 2007
). This line of reasoning calls into question the notion that gene discovery will be advanced substantially by the study of any particular “endophenotype”, given that so far these have not often demonstrated a much more robust genetic signal, or “simpler genetic architecture.”
Two basic strategies have been proposed to overcome these daunting challenges. Plan A, which we might label the “massively univariate” approach is to increase sample sizes to detect the modest effects that appear most likely to characterize associations between genes and high-level phenotypes. It remains unclear how large these samples will need to be before we will find genetic variants associated with syndromal phenotypes like schizophrenia or bipolar disorder, but recent evidence suggests that such samples will likely exceed 10,000, and perhaps 50,000, before variants with robust genome-wide significance are observed. Two recent GWAS studies focusing on schizophrenia and bipolar disorder phenotypes, examining sample sizes exceeding 10,000 individuals, have reported a handful of strong targets (Ferreira et al., 2008
, O’Donovan et al., 2008
). Assuming that results from these studies stand up to replication, it is conspicuous that there is virtually no overlap with the long lists of candidate genes that have been suggested so far for these syndromes.
Plan B, which might be labeled the “multivariate” approach, or the “phenomics” approach, is to develop strategies to increase the magnitude of variance shared between genotype and phenotype through clever redefinition of genotype or phenotype, and the paths that relate genotype to phenotype. The cardinal premise of the “endophenotype” strategy is that there exist phenotypes “closer to the gene” that will share more variance with real gene effects. While the Flint and Munafo survey does not offer much cause for optimism, it should be recognized that this strategy has seldom been employed in GWAS studies so far, and that the potential increases in power of this strategy remain largely unexplored. Even when multiple phenotypes have been examined, these are often derived from a single level of analysis (e.g., multiple partially overlapping diagnostic schemata) that is far removed from putative biological substrates.
While not yet widely exploited for GWAS studies, some work with candidate genes has suggested that effect sizes may be considerably larger when examining neural system phenotypes (e.g., functional MRI measurements) relative to diagnostic or behavioral phenotypes (Egan et al., 2001
, Hariri et al., 2003
). It is also possible that more specific neurocognitive phenotypes may have strong relations with individual gene effects, by virtue of being more closely related to the physiological processes actually impacted by the genes. For example, traditional measures of “executive function” such as the Wisconsin Card Sorting Test perseverative error score shared less than 5% variance with the COMT val158
met polymorphism, while more specific measures of cognitive set-shifting shared up to 40% variance with genotype (Bilder et al., 2002
, Bilder et al., 2004
, Nolan et al., 2004
). Such high estimates might be inflated by chance given application of the candidate gene approach with small sample sizes, but highlight the possibility that refined phenotyping may yield greater promise not only by virtue of increasing statistical power, but further by enhancing insight into plausible mechanisms.
Even these theoretically more refined phenotypes are still derived from a single level of analysis, and it remains unclear what advantages might be obtained by defining new phenotypes that span different levels of investigation. For example, rather than identifying a new and improved neuroimaging or cognitive phenotype, we might find both more power and great mechanistic insight from combining a historical phenotype, an imaging phenotype, a cognitive phenotype, and a symptom phenotype. For example, perhaps a stronger genetic association might be found for individuals with poor premorbid social function, gray matter volume reduction, poor working memory, and negative symptoms, than could be found for any one of these alone. This may seem a counter-intuitive strategy from the experimental perspective, but certainly has parallels in other disease areas. For example, the diagnosis of cardiac valve dysfunction benefits greatly from combining history (e.g., of early rheumatic fever), with laboratory results (e.g., from EKG), and behavioral symptoms (e.g., shortness of breath on exertion).
The definition of novel multi-level, multivariate phenotypes may also benefit from advances in complexity theory, and confer substantially greater traction on what might seem to be insuperable obstacles. Particularly given the number of emergent properties putatively implicated in the traversal of biological levels from the genome to syndrome, our ability to identify convergences and self-organizing principles may help constrain the explosive expansion of possibilities to more manageable subsets. Thus, rather than attempting to explain the high heritability (perhaps 80%) of a syndrome such as schizophrenia by the additive effects of some 800 genetic variations each independently contributing 0.1% to phenotypic variance, there is hope that a smaller number of genetic variants might be identified that interact and converge on a more modest number of critical biological pathways. Stuart Kauffman has written eloquently about the “order for free” that characterizes large-scale networks, and applied these principles to problems as diverse as the origins of life and economics (Kauffman, 1995
). Similar methods have been used to help understand self-organization in neural networks specifically, and biological networks more generally (Tononi et al., 1994
, Sporns et al., 2005
). Progress is already being made identifying the redundancy and degeneracy in gene networks and other biological systems using techniques that integrate information sciences and systems biology (Sridhar et al., 2007
, Zheng et al., 2007
, Centler et al., 2008
). Application of similar methods to multi-level modeling of genome-to-syndrome hypotheses comprises a reasonable if not simple extension of these theoretical lines. Regardless of the specific methods that will be used, it is clear that novel strategies to effectively manage complexity of large scale networks spanning different levels of investigation will be critical to advance the emerging discipline of phenomics.