|Home | About | Journals | Submit | Contact Us | Français|
Completion of the human genome project and rapid progress in genetics and bioinformatics have enabled the development of large public databases, which include genetic and genomic data linked to clinical health data. With the massive amount of information available, clinicians and researchers have the unique opportunity to complement and integrate their daily practice with the existing resources to clarify the underlying etiology of complex phenotypes such as allergic diseases. The genome itself is now often utilized as a starting point for many studies and multiple innovative approaches have emerged applying genetic/genomic strategies to key questions in the field of allergy and immunology. There have been several successes, which have uncovered new insights into the biologic underpinnings of allergic disorders. Herein, we will provide an in depth review of genomic approaches to identifying genes and biologic networks involved in allergic diseases. We will discuss genetic and phenotypic variation, statistical approaches for gene discovery, public databases, functional genomics, clinical implications, and the challenges that remain.
Human genome variation encompasses all of the genetic characteristics observed within the human species. Genetic variation occurs both within and among populations and is the basis for natural selection. Insights regarding the distribution of genetic variants among human populations have recently become available1. Interestingly, human genetic diversity decreases in native populations as the migratory distance from Africa increases, presumably due to limitations in human migration2.
Nucleotide diversity is based on single mutations called single nucleotide polymorphisms (SNPs), which occur at a rate of 1 SNP per 1,000 base pairs3. Currently, there are more than 12 million SNPs deposited in GenBank, 6.5 million of which have been validated (http://www.ncbi.nih.gov/SNP). The bulk of variations at these nucleotide levels are not visible at the phenotypic level. A better understanding of the basis of genetic diversity was gained with the publication of full sequences of individuals genomes4, 5. The Human Genome Project and a parallel project by Celera Genomics yielded two haploid sequences, however, analysis of diploid sequences has revealed that non-SNP variation accounts for much more human genetic variation than single nucleotide diversity. Non-SNP variation includes copy number variation and results from deletions, inversions, insertions and duplications5. Copy number variation regions (CNVRs) have been found in 12% of the genome. CNVRs can be markedly different between populations and contain hundreds of genes, disease loci, functional elements and segmental duplications5. Taking into account this variation as well as SNPs, human to human genetic variation is estimated to be approximately 0.5%. This 0.5% difference amounts to a significant number of distinct genetic traits that uniquely distinguish the genome of every person and contribute to unique and distinct risks for diseases, responses to environmental exposures (including nutrition), and responses to pharmacologic treatment.
Epigenetic variation does not affect the underlying DNA code, but rather modifies how it is expressed through covalent modifications including DNA methylation, histone modifications, and microRNAs. It is the structural adaptation of chromosomal regions so as to register, signal or perpetuate altered activity states6. Detailed analysis of methylation across several chromosomes has demonstrated that the promoter region of nearly 20% of genes are methylated, many of which influence transcription7. Progressive accumulation of phenotypic differences between genetically identical monozygotic (MZ) twins illustrates how pollution, smoking, mold, diet, habits or, in general, environment can shape phenotype and disease susceptibility. MZ are epigenetically indistinguishable early in life but, with age, exhibit substantial differences in particular when they have led different lifestyles and had spent less of their lives together8, 9. Therefore, MZ twin discordance for many common disorders could be interpreted as the result of external, environmental factors that modulate susceptibility through a change in the profile of epigenetic modifications that ultimately determine gene function. The field of epigenetics has emerged to explain how cells with the same DNA can differentiate into alternative cell types and how a phenotype can be passed from one cell to its daughter cells. It is now well established that epigenetic mechanisms are important to control the pattern of gene expression during development, the cell cycle, and in response to biological or environmental changes10–13. Unlike genetic alterations, which are permanent and usually affect all cells, epigenetic modifications are cell type specific14. Epigenetic regulation of the immune system occurs at many levels including the differentiation of T cells6, 15–19. Epigenetic effects on gene expression may persist even after the removal of the inducing agent, and can be passed on, through mitosis, to subsequent cell generations, constituting a heritable, epigenetic change. In a somatic cell, a heritable change can generate a dysfunctional clone of cells with phenotypic consequences (e.g. a tumor). In a germ-line cell, a heritable change may be transmitted to the germ cells themselves (sperm or ova) and potentially to the next generation. In this model, epialleles may be in linkage disequilibrium with SNPs that are genotyped in genome-wide association studies. The role of epigenetics in allergic disease is becoming increasingly evident. One recent study showed that epigenetic reprogramming involving aberrant DNA methylation of a 5′-CpG island in acyl-CoA synthetase long-chain family member 3 (ACSL3) was significantly associated with asthma risk in children born to mothers exposed to air pollutants such as traffic-related combustion emissions20. Another study found that neonates of allergic mothers are born with substantial changes in DNA methylation in their splenic dendritic cells and these dendritic cells show enhanced allergen-presentation activity in vitro21. Current knowledge of epigenetics in allergic diseases is limited and novel applications of epigenetic approaches including genome wide approaches to allergic diseases are necessary to uncover the role of epigenetics.
Phenotype is the observable characteristics of an organism, as determined by both genetic makeup and environmental influences, including individual, physical, psychosocial environmental exposures (Figure 1). Genotype is the descriptor of the genome which is the set of physical DNA molecules inherited from the organism’s parents, while phenotype is the descriptor of the phenome, the manifest physical properties of the organism including its physiology, morphology and behavior.
Although single gene disorders in classical Mendelian inheritance result in direct genotype-phenotype correspondence, the relationship between genotype and phenotype in traits of multifactorial (complex) inheritance is complicated. In complex diseases with a multifaceted phenotype such as asthma, a given genotype can result in many different phenotypes and there are different genotypes corresponding to a given phenotype. While an individual’s genotype is fairly stable over a lifetime, an individual’s phenotype is dynamic, influenced by both the environment and the underlying genotype, including interactions between them22. The definition, measurement, and validity of phenotyping need to be standardized to increase the quality of research and the reproducibility of genetic studies22. Indeed, recently, NIH launched an initiative (PhenX) to address the need standardized phenotype and environmental exposures measures for cross-study comparison in genetics studies23. These measures do not include information for allergic diseases, however, the National Institute of Allergy and Infectious Diseases recently partnered with the National Heart, Lung, and Blood Institute, the National Institute of Environmental Health Sciences, the National Institute of Child Health and Human Development, the Agency for Healthcare Research and Quality, the Merck Childhood Asthma Network, and the Robert Wood Johnson Foundation to host a Asthma Outcomes Workshop. The objective of this workshop was to develop standardized definitions and data collection methodologies for established and validated asthma outcomes measures. The goal is that these outcomes will be broadly used in NIH-funded studies24.
There are several important variables to consider when defining a phenotype for studies of allergic disorders including disease definition, atopic status, comorbidities, and disease outcomes. For example, severe asthma is a recognized asthma phenotype defined by receiving ongoing treatment with high-dose inhaled corticosteroids, oral corticosteroids, or both for at least 6 months with persistent symptoms or exacerbations when the controller medications are tapered25. However, “severe asthma” is not a single phenotype. Population studies have revealed differences in severe asthma that begins in childhood versus adulthood26–28. Childhood-onset asthma is often “allergic”, while adulthood-onset asthma is more heterogeneous and often is not related to allergy, but rather to other influences including aspirin sensitivity, hormonal influences, and occupational exposures. This heterogeneity strongly supports the need for genetic studies aimed at uncovering the mechanistic bases for each distinct phenotype, rather than the mixed phenotype of asthma.
Age is an important factor in defining phenotypes for allergic disorders. As a population ages, it will be exposed to more environmental factors (e.g. environmental tobacco smoke, diesel exhaust, air pollution) that contribute to the pathogenesis of asthma and allergy, thus increasing sporadic (non-genetic) occurrences of these disorders. Thus, when studying a cohort of adults, there will be a proportion of individuals who could be classified as having asthma because of environmental exposures without a major genetic risk. Children on the other hand, may reduce the heterogeneity of the etiology of asthma because they have had minimal time to accumulate environment exposures, which would increase the risk of asthma. Given the risks of misclassification of asthma in the very young and the heterogeneity in the older groups, serious attention should be focused on the ages of participants. There has been a strong focus on powering genetic studies with very large sample sizes, however, large cohorts may not help improve our understanding of the genetic underpinnings of allergy phenotypes as much as precise phenotyping. Phenotypes can be defined through combinations of clinical information and individual biomarker and molecular data.
The phenotypic definition of controls is another important consideration, especially in studies of allergic disease where some features may be overlapping. For example, allergic sensitization may overlap with childhood asthma, so if a study aims to identify specifically childhood asthma genes, the control group should include sensitized subjects without asthma. The selection of the controls should be based on the goals of the research. With the availability of genotypic and phenotypic data through public resources such as dbGAP (http://www.ncbi.nlm.nih.gov/gap), it is enticing to consider the recruitment of controls as unnecessary. However, controls unselected with respect to phenotype increases the number of participants required to obtain similar power when using controls, which do not have the phenotype of interest. This is compounded by the fact that the publicly available controls are likely to be from a different population than the cases. When this situation occurs, researchers should consider applying genetic ancestry matching (discussed below) to minimize population stratification29.
There are three main statistical approaches to gene discovery, linkage, association, and admixture mapping. Linkage analysis tests to determine whether a variant co-segregates with disease in families; association analysis tests to determine whether a genetic variant occurs more often in individuals with disease than without disease; and admixture mapping tests to determine whether there particular regions of the genome at which inheriting DNA from ancestors from a certain region of the world predisposes one to particular diseases. Linkage studies can be performed only in family-based studies, while association testing and admixture mapping can be performed in both population- or family-based studies. These approaches may appear to ask the same questions, but statistically these are independent tests, and the strategy affects the hypotheses that can be tested.
Linkage analysis is based on the assumption that the genetic marker and the disease variant are in close proximity and transmitted intact across generations30. Thus, markers in close proximity to the disease-causing gene segregate with disease in families. However, the resolution of linkage is poor with candidate regions encompassing hundreds of genes. Thus, linkage analysis only identifies regions not genes or variants. Further, as linkage is statistical evidence, replication is the gold standard to minimize the risk of false positives.
An alternative approach is an association study, which can utilize population or family based designs. It is important to recognize that association does not equal causation. Association studies simply measures statistical dependence between two or more variables. Significant associations can be due to one of several misleading factors including LD, population stratification, or random chance. Once significance is achieved, replication is required to ensure the validity31.
Admixture occurs when two or more genetically diverse populations merge to form a new population32. Localizing disease genes using an admixed population is called admixture mapping. In human admixture studies, researchers combine information about known population history with information from individuals’ measured genotypes using known ancestry informative markers (AIMs). Studies consistently show that allergic disorders such as asthma are more common in people of West African ancestry compared with people of European ancestry33. The African-American population is an admixed population for which about 20% of the genetic material traces to European ancestry34. The association between increased asthma risk and African ancestry and the admixed nature of the African-American population34 suggests that admixture mapping35 might be an important asthma gene-finding strategy to study genetically heterogeneous populations.
With current technology, it is not cost prohibitive to perform genome-wide linkage and association studies. An advantage of the genome-wide approach is that it requires no a priori evidence and, thus, has the ability to identify regions and variants in genes previously not implicated in allergic disorders and provide insights into the biologic underpinnings for these disorders. Researchers using genome wide approaches must adjust the level of significance to ensure that findings did not occur by chance; with the increased numbers of statistical tests, the likelihood of obtaining a p-value of 0.05 increases. For the current GWAS SNP chips (density 1M SNPs), significance thresholds of 10−8 are required31 to control for multiple comparisons. Given this level of significance, the number of samples required to obtain adequate power in a genome wide association study (GWAS) is in the thousands for a gene with modest effect. By limiting the analysis to those gene regions, which have promising a priori evidence of being involved with asthma, the severity of the correction for multiple testing becomes much less severe. A candidate gene study examining 1000 SNPs will require only 60.5% of the sample size required by a GWAS study examining 1 million SNPs to obtain the same statistical power of 80%. This reduced sample requirement may permit better phenotyping and reduced heterogeneity, which will also improve the power. Thus, there are benefits to both GWAS and candidate gene approaches.
Because asthma is a prevalent disorder, the classic population based sampling strategy is case-control. In this approach, the researcher collects individuals with disease (cases) and unrelated individuals without disease (controls). This method is very efficient; compared to a random sampling design, only 35% of the total sample would be required for equivalent power (assuming an asthma frequency of 10%). While this approach appears simple, the challenge is ensuring that the controls come from the same ancestrally homogeneous population as the cases. When cases and controls are not drawn from the same ancestral population, population stratification can result in spurious associations36. For example, suppose most people of African ancestry in a sample had brown eyes and also happened to have asthma, while most people of European ancestry were blue-eyed and asthma-free. A naïve analysis might conclude that the brown-eyes SNP is responsible for asthma, even if eye color and disease are completely unrelated. That is, the methods are likely to nab the wrong SNP suspects, due to “guilty by association”. This problem becomes more pronounced in studies surveying the entire genome because of the huge number of ancestry-related SNPs being tested. To address this genetic-mixing problem, researchers can test whether cases and controls differ over a large number of variants not expected to be associated with disease. If differences exist, adjustments can be made to minimize this effect37. Currently, three fundamentally different methods are used to correct for confounding in allergy genetic association studies37–39. These methods are (1) genomic control, (2) structured association, and (3) principal component analysis. Genomic control uses a set of non-candidate, unlinked loci to estimate an inflation factor, l, which was caused by the population structure present and then corrects the standard Chi-square test statistic for this inflation factor. The structured association method utilizes Bayesian techniques to assign individuals to “clusters” or subpopulation classes using information from a set of non-candidate, unlinked loci and then tests for an association within each “cluster” or subpopulation class. To control for population confounding by variations in background ancestry during structural association testing (SAT), ancestry informative markers (AIMs) panel can be used35. Therefore, AIMs can be also termed structure informative markers (SIMs). These markers exhibit differences in frequencies between population groups. Importantly, care should be taken in selecting which AIMs to use as some sets may be population specific40. Principal component analysis (PCA) involves a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components. It can be used to identify and adjust for population substructure37. Family based association tests protection against stratification, a decided advantage of family based designs41.
Publicly available databanks now contain billions of nucleotides sequence data collected from over 260,000 different organisms42. This proliferation of data from genome sequencing over the past decade has resulted in dramatic changes in the way the scientific community is communicating and carrying out genomic research. Once a genome wide or candidate gene study has been performed, the investigator can readily obtain information about an identified SNP, including where it is located, its potential functional significance, its frequency in different populations, and what else may already be known (Figure 2). A summary of available public resources is summarized in Table 1. The PUBMED (http://www.ncbi.nlm.nih.gov/sites/entrez) site will provide information on whether a SNP is in a gene and whether there are reported genotypic and allelic frequencies for major population groups. The database of genomic variants (http://projects.tcag.ca/variation/) is also a useful tool. This site permits the researcher to zoom out and get a broader view of the genomic region containing the SNP of interest including features such as newly reported genes, transcripts, and copy number variants. The website UCSC Genome Browser (http://genome.ucsc.edu) also provides excellent information about the features of the genome in a particular region. While each of these sites is an excellent tool to examine a small number of SNPs, a large number of SNPs can be investigated efficiently using a high throughput method, such as the SNP and CNV annotation database (http://genemem.bsd.uchicago.edu/newscan). Once the most promising SNPs have been identified, databases are available to provide estimates of putative functionality of the SNPs. FASTSNP (http://fastsnp.ibms.sinica.edu.tw/pages/input_CandidateGeneSearch.jsp) evaluates all SNPs in a gene region using the methodology proposed by Tabor and colleagues43, 44. If the SNP is non-synonymous then SNPeffect (http://snpeffect.vib.be/search.php) can provide additional information about the molecular properties of the variant. In order to determine what is already known about a specific SNP or genes in terms of disease associations, the Genetic Association Database (http://geneticassociationdb.nih.gov/) is a useful tool. It is an archive of genetic association studies. It is searchable by both disease and by gene45. A catalog of published GWAS is regularly updated and deposited at http://www.genome.gov/GWAStudies 46. Another resource available is the relationship between SNP variants and gene expression (http://www.scandb.org)47.
To accelerate the identification of common disease alleles, the International HapMap Project in 2002 initiated the construction of a genome-wide SNP database of common variation (http://www.hapmap.org). In brief, the phase I and II project has genotyped over 3 million SNPs in 269 samples from 4 populations (90 Utah Residents (30 parent-offspring trios) with Northern and Western European Ancestry (CEU), 45 Han Chinese from Beijing, China (CHB), 44 Japanese from Tokyo, Japan (JPT), and 90 Yorubans (30 trios) from Ibadan, Nigeria (YRI). The average spacing of the map is one SNP per 1000 bp, and this vast resource is currently being used globally as a template for both LD-based candidate gene and genome-wide association studies in allergic disorders. To increase the sample size to over 1000 individuals in 11 populations, the HapMap phase III has recently released draft version of the dataset (http://www.hapmap.org). HapMap genotypic data, allele frequencies, LD data, phase information and sample documentation are publicly and freely available for download from HapMap website (http://www.hapmap.org).
While whole human genome sequencing is possible48, the cost and challenges with dealing with such a large quantity of data makes this approach untenable currently. However, SNPs that are physically close to one another on the chromosome are more likely to be inherited together than SNPs farther apart. Linkage disequilibrium (LD) is a measure of this non-random correlation between pairs of SNPs. Thus, if a causal variant is in LD with a marker SNP, then the marker will be associated with the phenotype proportional to the degree of LD between the two. Further, there are blocks of high LD conserved within populations49. The coinheritance between SNP alleles showing strong linkage disequilibrium, or LD enable most of the common genetic variations in a region to be captured by genotyping subsets of SNPs (termed haplotype-tagging SNPs, or tagSNPs) across a candidate gene or region of interest. Because redundant information can be reduced (thus reducing cost), many studies will often use the tagging SNP approach. A challenge is that tagging SNPs are not selected for their likelihood to be functional. However, recent work has shown that information from unmeasured SNPs can be imputed using tagging SNPs50, 51. Imputation requires use of a reference population in which genotype information is available for a large number of SNPs52. While some of these SNPs would overlap with the genotyped tagging SNPs in a given study, others would be untyped SNPs in LD with the genotyped SNPs. By delineating the genotype patterns in the reference set, researchers can make reasonable inferences about what genotypes are likely to be carried by individuals at untyped SNPs in their study. It is essential that the reference population is similar in ancestry to the population in which imputation will be performed. Fortunately, HapMap53 provides publicly available information on over 3 million SNPs in four major ancestry groups. Once imputation is performed then imputed SNPs can be tested for association with disease in the population of interest52. Since imputation interrogates all common variants, the likelihood of identifying biologically relevant associations (e.g with functional variants) is greater. Another advantage of imputation is that studies may not utilize the same SNPs in the original discovery phase. With imputation, even studies which have investigated different SNPs can be combined to determine the overall evidence for a given association52.
Most genetic studies, including GWAS, investigating common diseases have focused on common genetic variants on the assumption that common variants are mostly likely to contribute to common diseases (common disease/common variant hypothesis)54. There is emerging interest in association studies of rare variants and it is hypothesized that rare variants are more likely to be functional than common variants. Further, recent evidence supports that rare genetic variants can create synthetic associations that are credited to common variants55. While genetic association and linkage studies are well suited to find common variants for common diseases, they are not optimal for identification of rare variants56. Rare alleles with major phenotypic effects can contribute significantly to common traits in the general population57. Sequencing of candidate genes or entire genomes is the optimal way to identify rare variants. Unfortunately, most current studies are not designed or powered to identify and/or test the contributions of rare SNPs to common disease. Although current approaches are not optimal to elucidate rare variants, they can identify regions of interest, which harbor rare variants; these regions can then be further analyzed by deep resequencing (the determination of a new genome sequence relative to a reference genome is often referred to as “resequencing”).
Recently, approaches have been utilized to study the potential health impact of private SNPs, i.e. SNPs that have only been found in a given population58. In one study, investigators explored private SNPs in specific populations that may have phenotypic effects. They found that these SNPs contribute to variability in several cellular processes59. Such variability may provide clues regarding ethnicity-specific responses to diseases or drugs. Another recent study found that in African Americans, private SNPs were associated with asthma60. Investigation of rare and private SNPs requires deep sequencing approaches. The 1000 Genomes Project, a deep-resequencing project aimed at providing detailed genetic variation data on over 1000 genomes from 11 populations around the world, will aid these efforts (www.1000genomes.org). This project will identify over 95% of the variants with allele frequencies of more than 1% in human genome, substantially enhancing the HapMap data. Results from the 1000 Genome Project will provide data to allow evaluation of the common disease common variance (CD/CV) hypothesis versus the common disease many rare variants (CD/RV) hypothesis61.
Once a genetic study has been performed and allergy causing variants have been identified, the investigator can gain information to unify the biological function of gene products. Several groups have reported that genes involved in predisposing to a given polygenetic disease tend to share more commonalities (annotated by similar GO terms) in their molecular function or biological pathway than genes chosen at random or genes not involved in the same disease62–69. Gene Ontology (GO, http://www.geneontology.org) can be used to identify commonalities between gene products in the form of an agreed ontology. It provides a controlled vocabulary about genes and gene products based on known or predicted molecular function, cellular location, and biological process70. Because of the existing homologies between proteins among different taxa, the GO terms provides researchers with a powerful way to query and analyze functional genomic information in a way that is independent of species70, 71. Once genetic analyses determine which genes (among the thousands analyzed) may be related to the phenotypes, functional genomics experiments allow the scaling of the classical functional experiments to a genomic level72. The GO analysis could potentially be used to reduce the number of targets of a large group of correlated genes and to find biological functions potentially affected by multiple genes. In summary, GO annotation terms are enriched among genes linked to the trait, and such commonalities are often sufficient to narrow the list of candidate genes69.
Both coding and non-coding variability contribute to genetic variation. Novel approaches to capture human genetic variation have integrated expression global gene expression arrays, DNA sequence variation arrays, and public databases (Figure 3)73. This strategy has been successfully applied to asthma74. In association studies, the investigators found markers on chromosome 17q21 to be reproducibly associated with childhood asthma. They then evaluated the relationships between the markers and transcript levels of genes in cell lines derived from children in the association study. The SNPs associated with childhood asthma were associated with transcript levels of ORMDL3, suggesting that genetic variants regulating ORMDL3 expression are determinants of susceptibility to childhood asthma. Thus, gene expression data informed the genetic data and provided insights regarding the biologic mechanisms that may be involved. Gene expression arrays can also be used in a discovery approach to identify dysregulated genes and pathways. The gene expression profiles can be used to identify key regulatory networks, to identify novel potential candidate genes, and to define phenotypes, which can then serve as quantitative traits for genetic studies. Variation in gene expression is an important mechanism underlying susceptibility to complex disease. An integrated genetic/genomic approach allows the mapping of the genetic factors that underpin individual differences in quantitative levels of expression (expression QTLs; eQTLs)75. The major public data repositories, ArrayExpress and Gene Expression Omnibus (GEO), house raw microarray data and serve as warehouses for processed experimental data, facilitating gene-based queries of multiple expression profiles. ArrayExpress (http://www.ebi.ac.uk/microarray-as/ae) is a public repository for experimental microarray data, queryable based on a range of gene annotations including gene symbols, GO terms and disease associations76. GEO (http://www.ncbi.nlm.nih.gov/geo) is a public repository that archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomic data.
Using a candidate gene approach, common mutations in filaggrin gene (FLG, 1q21) have been implicated in the causation of ichthyosis vulgaris77–79. Filaggrin80 (filament aggregation protein) is a major epidermal protein involved in maintaining the skin barrier81 and previous studies had demonstrated that filaggrin was absent or reduced in the skin cells of individuals with ichthyosis vulgaris82. Several independent replication studies have now provided convincing evidences of an association of FLG mutations with atopic dermatitis (AD)83–85. The estimated penetrance varies from 42% to 79%86, 87 i.e, between 42% and 79% of individuals with one or more FLG null mutations are likely to develop atopic dermatitis. The discovery that null mutations in FLG are associated with atopic eczema represents the single most significant breakthrough in understanding the genetic basis of this complex disorder. In addition, this association has yielded important insights into the biologic underpinnings of AD and support for the hypothesis that a barrier defect may be a contributory mechanism for the pathogenesis of AD and related atopic disorders83, 88. The exact contribution of FLG to atopic disorders remains to be delineated. The identification of patients with these FLG mutations may facilitate the targeting of novel therapies to repair or replace the defective epidermal barrier89.
Genome-wide association studies have also yielded successes. As discussed above, the association of ORMDL3 with asthma was first identified by GWAS74. Since the initial report, multiple groups have replicated the association between ORMDL3 variants and asthma90–96. Further, these variants have recently been found to associate not only with ORMDL3 expression, but with transcripts of multiple genes in this region92. Increased expression of ORMDL3 has been associated with the unfolded-protein response (UPR)97. There is still much work to be done in this area, but it further illustrates how genetic/genomic approaches can provide insights into novel biologic networks and potential disease mechanisms.
Genetic association including GWAS studies have identified hundreds of genetic variants associated with complex human diseases including 43 replicated genes for asthma98. Most variants identified so far confer relatively small increments in risk, and explain only a small proportion of disease heritability. This has lead to considerable speculation regarding the sources of the remaining, “missing heritability”99. Much of the speculation has focused on the possible contribution of rare variants (minor allele frequency 0.5% – 5.0%). Such variants are not sufficiently frequent to be captured by current genotyping arrays, nor do they carry sufficiently large effect sizes to be detected by current studies. With the completion of the human genome, more focus has gone into dense re-sequencing of regions. As the cost of sequencing is still high, researchers often sequence DNA pools to identify variants which that can be explored with additional genotyping100, 101. The pooled samples reliably detect variants at a frequency of 1% or greater with as little as 287 samples100. Further, if overlapping pools are used, these samples can be used to estimate allele frequencies101. Once variants are identified, the next challenge is how to proceed. Much larger samples are needed for the identification of associations with variants than those needed for the detection of the variants themselves. One technique that has been employed is to group rare variants such that the presence of any one of a number of rare variants is examined for disease association. However, this is complicated by the fact that the rare variants may have disparate effects on phenotype making this approach uninterpretable.
Structural variants, including copy number variants (including insertions and deletions) and copy neutral variation (including inversions and translocations) may account for some of the unexplained heritability102. While the variation affecting large chromosomal regions can result in large phenotypic perturbations, small/regional copy number variation can have minimal to severe effects on phenotype103. In 2006, the first comprehensive CNV map of the human genome was published104. Since then, CNVs have been associated with many different diseases including asthma105. The challenge for copy number variants is detection102. Furthermore, in a recent study, two copy number algorithms resulted in poor agreement106. Thus, while CNV analysis offers promise, the technical and statistical assessment of CNVs is still evolving107, 108.
The modest size of genetic effects detected thus far confirms the mulitfactorial etiology of these complex disorders. The next frontier of genetic studies will require innovative approaches to look for the sources of missing heritability. This will include application of whole genome sequencing to people with extreme phenotypes, use of expanded genome variation data provided by the 1,000 Genomes project, development of novel methods to detect additional sources of variation, improved phenotyping and use of eQTLs, expanded efforts in epigenetics and identification of epigenetic variation, rigorous assessment of environmental influences and gene-environment interactions, assessment of gene:gene interactions, and the design of meta-studies with well defined consistent phenotypes spanning across large population sets.
This work was supported by National Institutes of Health grants U19A170235 (GKKH and LJM) and P30HL10133 (TMB).
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.