|Home | About | Journals | Submit | Contact Us | Français|
Phenotype mining is a novel approach for elucidating the genetic basis of complex phenotypic variation. It involves a search of rich phenotype databases for measures correlated with genetic variation, as identified in genome-wide genotyping or sequencing studies. An initial implementation of phenotype mining in a prospective unselected population cohort, the Northern Finland 1966 Birth Cohort (NFBC1966), identifies neurodevelopment-related traits—intellectual deficits, poor school performance and hearing abnormalities—which are more frequent among individuals with large (>500 kb) deletions than among other cohort members. Observation of extensive shared single nucleotide polymorphism haplotypes around deletions suggests an opportunity to expand phenotype mining from cohort samples to the populations from which they derive.
Genetic association studies usually start with a predefined phenotype and then search for genotypes that at least partially account for the sharing of that phenotype among members of a study sample. Yet many commonly studied classes of phenotypes, such as most psychiatric disorders, are imprecisely defined and difficult to assess objectively, limiting the accuracy with which they can be designated as shared between individuals in a study sample, contributing to disappointing results in genome-wide association analyses. The designation of genotypes, by contrast, is essentially objective and precise. The widespread availability of study samples genotyped at high resolution across the genome therefore opens up an alternative strategy for association analysis: a search for phenotypic features which are over-represented among individuals possessing particular genotypes or classes of genotypes. This search, which we term phenotype mining, requires study samples for which extensive phenotype data are available, as is exemplified by several prospective population cohorts.
The implementation of phenotype mining strategies requires consideration of several methodological issues. Most importantly, when evaluating the significance of associations detected between genotypes and multiple possible phenotypes, one needs to account for the extensive search that is intrinsic to this approach. While the genetic space is well defined, we cannot describe generically the total number of possible phenotypes. For example, in a longitudinal cohort study, individuals may be assessed periodically over several decades and a vast quantity of data collected. The types of data typically vary enormously, both in their suitability as phenotypes for genetic analysis (whether they represent trait rather than state measures and whether they are heritable) and in their degree of dependence on each other. It is therefore not possible to define, extrapolating from the specifics of a given study, a general threshold for statistical significance. Instead, in the context of each investigation, the total number of available and investigated phenotypes will need to be carefully described and appropriate corrections for multiple comparisons adopted. In the present study, where our goal is mainly to illustrate the potential of phenotype mining, we have adopted a highly simplified design, in which the search space in both the genetic and phenotypic domains is substantially restricted, in a carefully selected study sample.
Genotypic variants with a hypothesized functional effect provide an obvious starting point for phenotype mining studies. It is not yet feasible to investigate the phenotypic correlates in large data sets of comprehensive collections of genome-wide functional variants, although the increasing availability of exome and even whole genome sequencing may enable such investigations in the near future. Therefore, rather than attempting to examine all functional variants currently known across the genome, we focus on large copy number variants (CNVs), which we expect, as a class, to be responsible for sizable phenotypic effects, and which are readily detected using standard genotyping arrays. Similarly, rather than considering all of the possible traits assessed in the genotyped study sample that we have examined, we restrict the phenotypic space to all of the readily identifiable measures available in this study sample that we could reasonably classify as neurodevelopmental phenotypes. We chose this restriction based on numerous recent studies showing CNV associations for neurodevelopmental disorders, such as developmental delay, autism spectrum disorders (ASD), schizophrenia and epilepsy (1–8), reasoning that in general such CNVs likely play a role in the genetic underpinning of this class of phenotypes.
We report here our results in applying a phenotype mining approach to large (>500 kb) CNVs, identified in a genome-wide single nucleotide polymorphism (SNP) genotyping study of 4932 individuals drawn from an unselected prospective birth cohort, the Northern Finland 1966 Birth Cohort (NFBC1966), consisting of all individuals born in the northernmost provinces of Finland in 1966 (9,10). Routine follow ups of NFBC1966 subjects have generated a longitudinal phenotype database that includes information from official registers, hospital records, questionnaires and clinical examinations of the participants. The NFBC1966 study sample, in addition to having been assessed, over several decades, for a wide range of phenotypic measures, derives from a population which has been exceptionally well characterized genetically (11,12). Using the large CNVs detected genome wide as the starting point, we searched the NFBC1966 database for phenotypic differences, in seven pre-selected traits hypothesized to relate to brain development, between individuals carrying CNVs longer than 500 kb, and individuals not carrying such CNVs.
The genome scan of NFBC1966 was conducted on genomic DNA extracted from peripheral blood samples using an Illumina Infinium 370cnvDuo array, which includes 317 000 SNP markers and an additional 55 000 probes designed specifically for regions known to vary in copy number. The scan yielded 634 confirmed autosomal CNV calls of over 500 kb. The confirmed CNVs had an average SNP density of 0.11 SNP/kb, ranging from one SNP per 2.4 kb to one SNP per 25 kb supporting the assumption that the CNVs detected predominantly consist of unique sequence. The average SNP density did not differ between deletions and duplications (t = 1.025, P = 0.307). The CNVs detected had an average size of 991 kb and spanned 165 distinct CNV regions (CNVRs) (Supplementary Material, Table S1), which were relatively evenly distributed among chromosomes (Supplementary Material, Fig. S1), and in total encompassed ~7% of the genome (198 Mb). This proportion is roughly comparable with that observed in previous surveys of large CNVs (13,14).
Although the majority (65%) of the 165 CNVRs contained only a single CNV, the remainder included partially overlapping CNVs, with 7% of CNVRs containing 10 or more CNVs and 11.5% of CNVRs containing reciprocal CNVs (both deletions and duplications). Of the 165 CNVRs, 113 (68.5%) had over 50% overlap with at least one CNV of identical call in the Database of Genomic Variants (http://projects.tcag.ca/variation/) or DECIPHER (http://decipher.sanger.ac.uk/). In addition, 18 (10.9%) of the 165 CNVRs overlapped with a known micro-deletion/-duplication syndrome in DECIPHER. The 634 CNVs were observed in 529 of the 4932 genotyped individuals, most of whom (83%) displayed only a single CNV (Table 1), and they were similarly prevalent in males and females (P = 0.69). Of the 634 CNVs, only 28% were deletions; the significant underrepresentation of deletions compared with duplications (P = 8.6*10−28) is concordant with observations from previous studies of large CNVs (13,15). No autosomal homozygous deletions of > 500 kb were detected. We cannot rule out the presence of multiplications (i.e. duplications of over three copies) in the data set, as these are difficult to differentiate from simple duplications using the 370cnvDuo array data employed in this analysis. In addition to large CNVs, five individuals with trisomy 21 and three with trisomy X were identified. These individuals were not carriers of large CNVs and were not included in the analyses of phenotypes.
We constrained our initial mining of the NFBC1966 database to seven phenotypes (Table 2) that we hypothesized to be related to neurodevelopment and therefore potentially correlated with CNV carrier status: mental sub-normality, defined as intelligence quotient (IQ) < 85 (16), a standardized measure of poor school performance (17), psychosis (18–21), epilepsy, neonatal convulsions (22–24), cerebral palsy or perinatal brain damage (23–25) and impaired hearing (26,27). We evaluated whether any of these phenotypes were overrepresented among 160 carriers of large deletions compared with 4381 individuals without an evident CNV. We adopted a conservative uncorrected significance threshold of 0.0024 which corresponds to a significance level of 0.05 when adjusting for 21 tests (seven phenotypes investigated with three comparisons: non-carriers versus (i) carriers of deletions, (ii) carriers of duplications and (iii) carriers of both deletions and duplications combined) using a Bonferroni correction.
We observed a higher frequency of IQ < 85 among deletion carriers (5.0%) compared with non-carriers (1.4%) (P = 0.0024, odds ratio (OR) = 3.79, 95% confidence interval (CI): 1.54–8.14) (Table 3). School performance may reflect subtle intellectual and/or behavioral deficits and is measured in a standardized way throughout Finland (17); students who do not fulfill a minimum set of criteria must repeat a grade. Information on grade in school was collected from all NFBC1966 participants at 14 years old. Of the deletion carriers, 10% had repeated a school grade at some point during their school career compared with 3.9% of non-carriers (P = 0.00088, OR = 2.70, 95% CI: 1.47–4.67) (Table 3). Low IQ is correlated with repeating a school grade (r2 = 0.65, Table 2). However, even when excluding individuals with IQ < 85, repeated years in school remained twice as common (5.3%) among deletion carriers compared with non-carriers (2.7%) (P = 0.071, OR: 2.02, 95% CI: 0.84–4.24) (Table 3). We also found impaired hearing to be more common among deletion carriers (8.1% in carriers versus 3.1% in non-carriers, P = 0.002, OR: 2.72, 95% CI: 1.38–4.95) (Table 3).
We evaluated the co-occurrence of these three overrepresented phenotypes in specific individuals and with respect to specific deletions (Table 4). Of the 160 deletion carriers, 24 possessed at least one of these phenotypes, 10 at least two phenotypes and three possessed all three phenotypes. We identified 18 distinct deletions (Supplementary Material, Fig. S2) in this set of 24 individuals, all of whom carried only one deletion. For the majority of these distinct deletions (11 of 18), the deletion carriers possessed at least two of the three overrepresented phenotypes. We also observed that, among these 24 individuals, five also possessed one or more of the other neurodevelopmental phenotypes that we investigated (Table 4).
To determine whether the phenotypic associations with the 18 CNVRs were specific to deletions, we then evaluated the seven neurodevelopmental phenotypes in relation to reciprocal duplications in these regions. We identified such duplications in 7 of the 18 CNVRs; a total of 86 NFBC1966 members carried a duplication in one of these seven CNVRs, none of whom also carried a deletion in these regions. We compared the frequency of each of the seven neurodevelopmental phenotypes separately between the 86 duplication carriers and the NFBC1966 members who have no CNVs; only impaired hearing displayed a nominally higher frequency among individuals with reciprocal duplications compared with individuals not carrying large CNVs (P = 0.04, OR = 2.55, 95% CI: 0.89–5.94) (Supplementary Material, Table S2).
We observed that 82% of CNVRs overlapped either partially or completely with one or more RefSeq genes, a finding consistent with previous estimates (13). This proportion did not differ between CNVRs containing and those not containing deletions (79 versus 83%, respectively; χ2 = 0.68, P = 0.41). Of the 18 deletions correlated with one or more neurodevelopmental traits, 15 (83%) overlapped with one or more genes. A similar percentage (85.4%) of the remaining 48 deletions not correlated with the neurodevelopmental traits overlapped with one or more genes.
After conducting the analyses of seven phenotypes from the NFBC1966 database, we obtained additional data sets containing information on early development, which we then used to carry out follow-up studies of the relationship between deletion status and developmental measures; these additional data covered the participants' first year of life (Supplementary Material, Table S3) and the period when they were between 1 and 6 years of age (Supplementary Material, Table S4), and were derived, respectively, from parent interviews and from health records of the participants. The deletion carriers and the individuals without large CNVs did not differ, at even nominal levels of significance, in the mean age at which they achieved any of the 10 developmental milestones assessed before age 1 year (Supplementary Material, Table S3). In contrast, the assessments made between 1 and 6 years of age suggest a higher degree of developmental difficulties in the deletion carriers than in the individuals without large CNVs. In particular (Supplementary Material, Table S4), the deletion carriers were less likely to know their own name at age 3 (7.70 compared with 19.83%, uncorrected P = 0.03, OR = 0.33, 95% CI: 0.09–0.93) and more likely to have abnormal hearing at age 4 (24.44 compared with 8.37%, uncorrected P = 0.0012, OR = 3.54, 95% CI: 2.58–7.24) at age 4, although none of these differences remained significant after correcting for multiple comparisons (Bonferroni correction for 44 tests in total). The nominal association between carrying large deletions and impaired hearing at age 4 remains of similar magnitude after removing the 24 subjects driving the associations in Table 3, suggesting that a relationship between the presence of deletion and impaired hearing at an early age is not limited to these subjects (data not shown).
Genetic drift in recently expanded population isolates such as Northern Finland can increase the frequency of otherwise rare functional variants and the chromosomal segments which contain them. Identification of such segments shared identical by descent (IBD) among members of a cohort sample may provide a means to extend the search for shared phenotypes to a wider population. Of the 18 regions involved in the increased prevalence of deletions in persons with neurodevelopmental phenotypes, eight were observed in multiple carriers. The clustering pattern of parental birthplaces of deletion carriers suggested a single origin for each of the eight deletions (Supplementary Material, Fig. S3). Haplotypes constructed using deletion-flanking SNPs show that seven of these eight deletions occur on a single allelic background, implying a single mutation event and a common ancestry for carriers of that deletion (Table 5). For example, a 15q11.2 deletion carried by 13 members of NFBC1966 is contained within a founder haplotype spanning over 800 kb (Fig. 1). Based on the estimated proportion of the genome-shared IBD by these 13 carriers (calculated from genome wide association study data, see Supplementary Material, Fig. S4), 12 were estimated to be related to at least one other carrier through a common ancestor living, at most, five generations ago (proportion IBD > 0.00195).
In standard genetic studies, the effects of particular genotypic variants are categorized as fully penetrant, partially penetrant or non-penetrant with respect to a pre-specified phenotype. An alternative approach, as described here, is to investigate the degree of association between pre-specified genetic variants and measures that comprise a spectrum of potentially related phenotypes. Searching the phenotype databases of genotyped population cohorts offers a means to characterize such spectra. In NFBC1966, carrier status for large deletions correlates with abnormalities in hearing and with broadly defined deficits in cognitive function, ranging from mental retardation (IQ < 85) to a subtle indicator of impaired cognitive performance (repetition of a school grade), assessed when participants were 14 years of age, and selected as a phenotype through database mining.
The diverse sources of data available for participants of NFBC1966 enabled database mining of additional measures relevant to cognitive function—as well as to hearing—at different developmental time points from those evaluated in our main analysis. The longitudinal comparison between the different data points suggests that the predominant impact of deletion carrier status on cognitive function is relatively subtle, and may not be readily detected early in life. An alternative explanation, however, is that the standardized assessment involved in repetition of a school grade yields a more robust phenotype than the subjective parental reports from which the infant developmental measures were derived. The apparent discordance between the infant and adolescent measures highlights the potential utility of the phenotype mining approach; the search for the specific phenotypic features most strongly associated with pre-selected classes of genotypes provides a means to refine phenotypic definitions for further genetic investigations and may contribute to the genetic dissection of complex phenotypes.
The finding that associations between deletion status and phenotypes assessed from health records attained nominal significance that did not survive a correction for multiple comparisons, illustrates a possible difficulty in constructing well-powered studies when applying phenotype mining to a wider range of phenotypic measures than we used in our main analysis. One possible solution to this problem is to conduct initial phenotype mining analyses as exploratory. The NFBC project is ideal for implementing such a strategy, as a later birth cohort from Northern Finland, NFBC1986 (28,29), has an identical study design to NFBC1966, and is potentially available to confirm associations identified in the older cohort.
Our results suggest that individual deletions may be implicated in a wide range of phenotypic variation, and that more comprehensive phenotypic mining could further extend or refine this range. That we did not observe significant correlations between duplications and neurodevelopmental phenotypes is consistent with long established observations indicating that large deletions generate more extreme phenotypic effects than do large duplications. Larger study samples with more extensive phenotypic data may be required for phenotype mining investigations of duplications.
Many of the 18 deletions observed in NFBC1966 members occur in the same chromosomal locations as deletions previously implicated in a wide range of neurodevelopmental phenotypes. Examples include 2p16.3, associated with schizophrenia and autism (30,31); 6p22.3, associated with developmental delay (32–35); 15q13, associated with Prader–Willi syndrome (PWS), Angelman syndrome, epilepsy, ASD and developmental delay (3,7,36,37); 16p11.2, associated with epilepsy, ASD and developmental delay (2,38,39); and 22q11.2, associated with the velocardiofacial syndrome, schizophrenia and developmental delay (40–42).
Several additional deletions lie in regions in which CNVs have previously been implicated in disease susceptibility, but do not match exactly the previously reported coordinates. For example, a 15q11.2 deletion overlaps the PWS region and disrupts four genes: PWRN2, PWRN1, C15orf2 and SNRPN. Of these genes, C15orf2, PWNR1 and SNPRN are monoallelically expressed in fetal brain (43) and their disruption might underlie the neurodevelopmental phenotypes observed among the deletion carriers. Additionally, submicroscopic deletions located on 15q11.2 about 1–2Mb proximal to the deletion observed in NFBC1966 subjects, and overlapping CYFIP1, have been implicated in developmental delay, schizophrenia and epilepsy (5,7,44). An NFBC1966 deletion on 17q21.3, which includes KIAA1267, partially overlaps with previously reported deletions encompassing CRHR1, IMP5, MAPT, STH and KIAA1267 which have been associated with abnormal neuronal development (45–47). Deletions spanning 10q26.2–q26.3 may play a role in the development of inner ear malformations, vestibular dysfunction and hearing loss (48), with HMX2 and HMX3 suggested as candidate genes for these abnormalities (48,49). The deletion reported here does not overlap with either of these genes, suggesting that the region contains other functional candidate genes that could account for hearing impairment, for example FOXI2, a fork head transcription factor that regulates embryonic development (50). Finally, the 6q16.1 deletion observed in NFBC1966 overlaps with deletions observed among patients with developmental abnormalities (51,52).
Evaluation of SNP genotype data enabled us to determine that most NFBC1966 individuals carrying the same large deletions inherited them identically from a common ancestor. This observation, together with the geographical clustering of parental birthplaces among deletion carriers, suggests the possibility of utilizing the distinctive population structure of Northern Finland to extend the study samples in which the phenotypic spectrum associated with these genetic variants can be investigated. By constructing extended pedigrees from such shared variants, it may be possible to elucidate genetic or environmental factors that influence phenotype or to identify additional phenotypic manifestations not assessed in the original cohort study. The NFBC and other genotyped Finnish cohorts are linked with numerous public data registries and therefore provide opportunities for exploring the relationship between specified genetic variants and a very wide range of phenotypic features not typically included in genetic investigations.
The study subjects were drawn from an unselected geographically based prospective birth cohort consisting of 96% (n = 12 058) of all live-born children born in the two most Northern provinces of Finland in 1966 (Northern Finland 1966 Birth Cohort, NFBC1966) (9,10). The cohort began by prenatal clinical data collection and has continued with routine follow ups at 0–1 year old, 14 years old and 31 years old that have generated a phenotype database that includes information from official registers, hospital records, questionnaires and clinical examinations of the participants (53).
In the present study, we utilized a subset of the collected phenotype and clinical data postulated to relate to central nervous system development. From a total of nearly 2500 potential phenotypic measures contained in the NFBC1966 database, we selected the variables to assess in this study based on literature evidence of putative genetic associations with CNVs. We combined information from different components of the database into seven phenotypic categories used in the analysis presented here: IQ < 85, a standardized measure of poor school performance, psychosis, epilepsy, neonatal seizures, cerebral palsy/perinatal brain damage and impaired hearing. Individuals with IQ below 85 were identified based on information drawn from routine follow ups collected at several time points monitoring physical and mental development and national registers. These included (i) a questionnaire filled by the midwife at birth; (ii) a questionnaire administered for all children admitted to a children's hospital during the first 28 days of life; (iii) diagnoses on admission to a children's hospital during 1966–1972; (iv) a questionnaire given to parents when the child was 1 year old, relating to his or her health and development; (v) hospital records and special forms for children who visited neurological outpatient clinics required because of their symptoms or because of the NFBC1966 study; (vi) all existing protocols for IQ tests and psychologist's evaluations from child guidance centers, hospitals and institutions for mentally retarded children; (vii) information from national registers of death certificates, hospital discharge registers and child subsidies for chronically sick and mentally retarded children (16). Information on school grade was collected by questionnaire at the age of 14 years from the participants (17). The Finnish school system is free and compulsory. If a minimum set of nationally standardized criteria is not met, repeating a grade is required (17).
The diagnosis for psychosis was determined according to the DSM-III-R criteria as previously reported (18–20). Individuals with epilepsy or seizures during the period of 1966–2004 were identified from the Finnish Hospital Discharge Register and from the Social Insurance Institution of Finland (22). A diagnosis of childhood epilepsies was collected and reported at 1 and 14 years old. The criteria for childhood epilepsies were met when an individual had at least one non-febrile seizure associated with unconsciousness (23,24). Cerebral palsy and/or perinatal brain damage and neonatal convulsions were defined based on hospitalization and treatment according to the National Hospital Discharge Register, and hospital charts (23–25). An individual was considered to have abnormal hearing if the air-conduction pure tone thresholds exceeded 20 dB at any of the frequencies of 0.25, 0.5, 1, 2, 3, 4, 6, 8 kHz, using testing protocols described previously (26,27).
Traits measuring early development were collected from parent interview during the first year of life and then from child welfare cards completed by nurses, which were recovered from hospitals for 67% of the total study sample. This information included data on infant fine and gross neuromotor milestone attainment and data on health, parenting and development collected annually until age 6. A total of 44 measures of development were selected for the analysis (Supplementary Material, Tables S3 and S4) (54,55).
DNA samples were collected from peripheral blood during the latest follow-up from a representative subsample of the study participants (53) and genotyping was completed for 5551 individuals using the Illumina Infinium 370cnvDuo chip, which includes 317 000 SNPs and an additional 55 000 probes designed specifically for regions known to vary in copy number. A CNV scan was completed for 4932 of these individuals. The genome-wide CNV scan was performed with the PennCNV software (56) and adjusted for genomic waves according to the genomic GC content (57) and population frequency of B-allele. To enhance the quality of the genotype data, individuals were excluded if their LogR ratio standard deviation was over 0.28, B-allele frequency (BAF) median was outside the range 0.55–0.45, BAF drift was over 0.002 or waviness factor was over 0.04 or below −0.04, according to the software recommendations. CNVs spanning over 500 kb and covered with more than 10 probes were manually confirmed using Illumina BeadStudio Genome-Viewer software (Illumina, San Diego, CA, USA). A subset of 18 loci containing 91 deletions was validated with an independent CNV calling algorithm, QuantiSNP (58) applying similar thresholds as for the PennCNV; this procedure confirmed 87 of the 91 deletions (96%).
The phenotypic variables were examined to determine which test to use based on the nature of the data, and whether categories needed to be combined in order to have a sufficient number of samples per cell. Continuous early life variables compared the mean time in months to achieve developmental milestone in those with and without a deletion, using t-tests. Binary developmental milestones were analyzed using Fisher's exact test. The clinical and intellectual phenotypes were analyzed using the Fisher exact test. The analyses were performed conditional on having a CNV. The correlation between the different phenotypes was estimated by Pearson correlation.
The haplotypes were phased and defined separately for the distal and proximal sides of a deletion using Beagle 3.1 software (59) using the genotype information of the complete study sample. Prior to phasing, extended allelic sharing of deletion carriers was used to specify a tentative recombination site for each haplotype. A recombination was considered to have occurred when homozygous genotypes for both alleles of a given SNP were observed. The tentative break points were then used to restrain the region to be phased. The kinship among individuals from the NFBC1966 was estimated from the genome-wide sharing of SNP genotypes using PLINK software (60).
This work was supported by The Academy of Finland (project grants 104781, 120315, 132797, and Center of Excellence in Complex Disease Genetics); University Hospital Oulu, Biocenter, University of Oulu, Finland; the European Community's Fifth/Seventh Framework Programme (EURO-BLCS, QLG1-CT-2000-01643, FP7/2007-2013); The National Heart Lung and Blood Institute (grant number 5R01HL087679-02), The National Institute of Mental Health (grant number 1RL1MH083268-01); the ENGAGE project (HEALTH-F4-2007-201413); The Wellcome Trust (grant numbers WT089061, WT089062); The National Institute of Neurological Diseases and Stroke (grant numbers PL1NS062410 and P30NS062691); The Medical Research Council (G0500539, PrevMetSyn/MRC); Sigrid Juselius Foundation; The Stanley Medical Research Center; The National Alliance for Research in Schizophrenia and Depression; the Biomedicum Helsinki Foundation; and the Jalmari and Rauha Ahokas Foundation.
The DNA extractions, sample quality controls, biobank up-keeping and aliquotting were performed in the National Public Health Institute, Biomedicum Helsinki, Finland with support from the Academy of Finland and Biocentrum Helsinki. In addition, the authors would like to acknowledge the study participants of the Northern Finland 1966 Birth Cohort. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Conflict of Interest statement. None declared.