|Home | About | Journals | Submit | Contact Us | Français|
Heterogeneity in phenotypic presentation of ASD has been cited as one explanation for the difficulty in pinpointing specific genes involved in autism. Recent studies have attempted to reduce the “noise” in genetic and other biological data by reducing the phenotypic heterogeneity of the sample population. The current study employs multiple clustering algorithms on 123 item scores from the Autism Diagnostic Interview-Revised (ADI-R) diagnostic instrument of nearly 2000 autistic individuals to identify subgroups of autistic probands with clinically relevant behavioral phenotypes in order to isolate more homogeneous groups of subjects for gene expression analyses. Our combined cluster analyses suggest optimal division of the autistic probands into 4 phenotypic clusters based on similarity of symptom severity across the 123 selected item scores. One cluster is characterized by severe language deficits, while another exhibits milder symptoms across the domains. A third group possesses a higher frequency of savant skills while the fourth group exhibited intermediate severity across all domains. Grouping autistic individuals by multivariate cluster analysis of ADI-R scores reveals meaningful phenotypes of subgroups within the autistic spectrum which we show, in a related (accompanying) study, to be associated with distinct gene expression profiles.
Autism spectrum disorders (ASD) are developmental disabilities resulting from dysfunction in the central nervous system and are characterized by impairments in three behavioral areas: communication (notably spoken language), social interactions, and repetitive behaviors or restricted interests (Volkmar et al., 1994). ASD usually manifest before three years of age and the severity can vary greatly. Idiopathic ASD include autism, which is considered to be the most severe form, pervasive developmental disorders not otherwise specified (PDD-NOS), and Asperger’s syndrome, a milder form of autism in which persons can have relatively normal intelligence and communication skills but still experience great difficulty with social interactions. ASD with defined genetic etiologies or chromosomal aberrations include Rett’s syndrome, tuberous sclerosis, Fragile X syndrome, and chromosome 15 duplication (reviewed in (Muhle, Trentacoste, & Rapin, 2004)) . Familial studies provide evidence that individuals closely related to an autistic individual (i.e. mother, father, and siblings) may have “autistic tendencies” but do not meet criterion for ASD, suggesting that a broad autism phenotype (BAP) may also exist (Piven, Palmer, Jacobi, Childress, & Arndt, 1997).
Previous studies establish a strong genetic component for the etiology of autism, and many loci have been proposed as autism susceptibility regions, including loci on chromosomes 1, 2, 7, 11, 13, 15, 16, 17 ( reviewed in (Gupta & State, 2007; Polleux & Lauder, 2004; Santangelo & Tsatsanis, 2005; Yonan et al., 2003)). However, the specific genes involved within each locus have not been determined to date. Available data further suggests that multiple gene interactions, epigenetic factors, and environmental risk factors may also be at the core of autism etiology(del Gaudio et al., 2006; Geschwind, 2008; Herbert et al., 2006; Jiang et al., 2004; Lathe, 2006; Varki, Geschwind, & Eichler, 2008).
Heterogeneity in phenotypic presentation of ASD has been offered as one explanation for the difficulty in pinpointing chromosomal loci and genes involved in autism. Thus, recent studies have attempted to reduce the “noise” in genetic data by reducing the phenotypic heterogeneity of the sample population using a variety of approaches. Some of the earlier studies stratified samples for genetic analyses primarily on language deficits of the proband (eg., age at first word, phrase speech delay), while other studies focused on other attributes of autistic disorder, such as compulsions, or Restricted and Repetitive Stereotyped Behaviors (RRSB) to restrict phenotypic heterogeneity (Alarcon, Cantor, Liu, Gilliam, & Geschwind, 2002; Bradford et al., 2001; Hollander et al., 2000; Silverman et al., 2001). Another strategy for increasing the probability of observing genetic linkage was based upon the use of “endophenotypes” for specific autism-associated behaviors which were present in nonaffected family members (Spence et al., 2006). Using this approach, Alarcon et al. and Chen et al. reported quantitative trait loci (QTL) for language and nonverbal communication deficits, respectively (Alarcon, Yonan, Gilliam, Cantor, & Geschwind, 2005; Chen, Kono, Geschwind, & Cantor, 2006).
The Autism Diagnostic Interview-Revised (ADI-R) is a comprehensive assessment instrument for ASD which is a clinician-adminstered interview that probes for language, social, behavioral, and functional abnormalities that are inconsistent with a specific child’s stage of development (Lord, Rutter, & Couteur, 1994). Principal components analysis (PCA) of 98 items from the Autism Diagnostic Interview-Revised (ADI-R) has also been used as a means to isolate genetically relevant phenotypes (Tadevosyan-Leyfer et al., 2003). This study identified 6 “factors” which accounted for 41% of the variation in the autistic population studied. Reexamination of genetic data from individuals defined by presence or absence “savant skills” (one of the factors) and from their respective family members showed an increase in LOD score (0.4 → 2.6) in the chromosome 15q11-q13 region relative to the combined unsegregated sample population (Nurmi et al., 2003). However, this finding could not be replicated by another group (Ma et al., 2005). Recent analyses of the use of the ADI-R to increase phenotypic homogeneity summarize the major studies which have attempted to stratify autism samples and further caution that such stratification based upon just a few defined attributes can also lead to unintended associations with other variables, such as age, gender, race, etc. (Hus, Pickles, Cook Jr., Risi, & Lord, 2007; Lecavalier et al., 2006).
In this paper, we demonstrate the use of multiple clustering methods applied to a broad range of ADI-R items from a large population (1954 individuals) to identify subgroups of autistic individuals with clinically relevant behavioral phenotypes. We further select individual male samples based on these cluster methods for gene expression analyses, demonstrating that the selected samples are indeed representative of the clusters identified within the broader autistic population, and cover a broad range in terms of age and symptom severity of ASD. In the accompanying manuscript, we show that the selected lymphoblastoid cell lines derived from individuals who fall within 3 of the phenotypic subgroups show distinct differences in gene expression profiles that in part relate to the severity of the phenotype. Functional and pathway analyses of the gene expression data also suggest distinct differences in the biological phenotypes that associate with these subgroups.
ADI-R score sheets were downloaded for 1954 individuals with autism from the Autism Genetic Research Exchange (AGRE) phenotype database. The gender and age profile of the individuals whose score sheets were used were as follows: 1526 males [age range: 1.85 – 47.68 yrs; mean age: 8.3 yrs; median age: 7.2 yrs]; 428 females [age range: 2.04 – 44.63 yrs; mean age: 8.15 yrs; median age: 7.12 yrs]. A total of 123 items that were identical or comparable on both 1995 and 2003 versions of the ADI-R were included. Following the example of Tadevosyan-Leyfer et al. (2003), “current” and “ever” scores were used for most of these items to provide some redundancy in the data and increase the robustness of the symptomatic profile of each individual. Only items scored numerically (0 = normal; 3 = most severe) were incorporated into our analyses. A score of 8 for items in the spoken language subgroup indicated that the items were not applicable because of insufficient language and was replaced with a rating of 3. Scores of 8 or 9 for other items (excluding those from the spoken language subgroup), which indicated the item was not asked or not applicable, were replaced with blanks to reflect that no information was available for that item. A score of 1 or 2 on item 19 (LEVELL) indicated an overall language deficit and, as a result, scores for items 20-28 were assigned a score of 3 to reflect impaired language skills, as previously done by others (Tadevosyan-Leyfer et al., 2003). Items with a score of 4 for the savant skills, which meant that the individual possessed an isolated though meaningful skill/knowledge above that of his general functional level or the population norm, were replaced with 3 to maintain consistency of the 0-3 scale across all items. Scores of 7 for some items were changed to a score between 0 and 3 depending on the nature of the question and how it reflected severity with respect to that specific item. A score of -1 indicated missing data (according to AGRE) and was replaced with a blank. It should be noted that the missing scores were random among clusters and did not appear to be an obvious factor in the cluster analyses. Supplementary Table 1 summarizes the score modifications for each item used in our cluster analyses of autistic individuals.
Data from ADI-R score sheets for 1954 individuals were loaded into MeV (Saeed et al., 2003), a software program created by John Quackenbush and colleagues to analyze microarray gene expression data. Each individual is represented by a horizontal row in the data matrix while ADI-R items are represented by vertical columns. Multiple clustering analyses were employed to subgroup individuals on the basis of similarity of ADI-R item scores, and included principal components analysis (PCA), hierarchical clustering (HCL), and k-means clustering (KMC), which is a “supervised” clustering method, for which the number of clusters (K) is specified. A fitness of merit (FOM) analysis (Yeung, Haynor, & Ruzzo, 2001) was also conducted to estimate the optimal number of clusters, while correspondence analysis (COA) was used to visualize the association of specific items with the different clusters of individuals. A description of each of these analytical methods is summarized by Saeed et al. (Saeed et al., 2003)
Lymphoblastoid cell lines (LCL) for DNA microarray analyses were selected on the basis of phenotypic clustering of autistic individuals using the methods described above. As described in the results, the application of multiple clustering algorithms to the selected ADI-R items from scoresheets of 1954 individuals resulted in 4 reasonably distinct phenotypic subgroups. Samples were selected from 3 of the 4 groups for gene expression analyses. These groups included those with severe language impairment, those with milder symptoms across all domains, and those defined by presence of notable savant skills. The intermediate group was not included because we first wished to test the concept that the extreme phenotypes of ASD (severe and mild) could be distinguished by gene expression profiling. The savant phenotype was included for gene expression analyses not only because savant skills are of general interest, but also because they were a dominant feature of the third principal component in the PCA analysis of probands (data not shown). Because we wanted to reduce the heterogeneity of subjects for our gene expression studies on idiopathic autism, we chose to exclude all probands whose autism could be attributed to a known genetic cause (Fragile X, chromosome 15 duplication, Rett’s Syndrome) and to avoid confounding factors due to diagnosed comorbid conditions (OCD, bipolar, etc.) or prematurity. In a previous and separate study on autistic-nonautistic male siblings (manuscript submitted), we observed differential expression of genes involved in steroid hormone biosynthesis (particularly androgens) and, wishing to avoid the complication of hormonal (gender) effects on gene expression, excluded females from the gene expression studies. Clearly, females with ASD need to be studied as well. Interestingly, separate cluster analysis of the ADI-R scores of male and female subjects were very similar, suggesting that they exhibit much of the same behavioral/functional phenotypes as males. In addition, a score < 80 on the Peabody Picture Vocabulary Test (PPVT) was used to confirm language deficits for those in the group identified by cluster analysis as having severe language impairment. For the accompanying gene expression study, 26-31 cell lines were obtained for each study group, along with 29 cell lines from “control” individuals who were nonautistic siblings of individuals with autism, matched roughly in age to the autistic probands, the majority of which were unrelated to the controls. In this study, we also applied cluster analyses to the ADI-R scores of the ASD individuals whose LCL were selected for gene expression analysis and demonstrate that applying our exclusion criteria did not change the cluster assignment for these samples. Supplementary Table 2 provides a demographic profile of the subjects selected for gene expression analyses which includes pedigree, age, race, ethnicity as well as standard PPVT and Raven’s scores.
To reduce the phenotypic heterogeneity of autism for gene expression analyses, we applied several different clustering methods to the scores from ADI-R questionnaires (from the AGRE database) describing 1954 autistic individuals. For these analyses, we selected 123 item scores that covered a broad spectrum of behaviors and functions in order to identify phenotypic subgroups of individuals with idiopathic ASD who were characterized by combined symptoms across multiple domains. These domains included language, nonverbal communication, social interactions, play skills, interests and behaviors, physical sensitivities and mannerisms, aggression, and savant skills. The specific items and score adjustments are shown in Supplementary Table 1.
Principal components analysis of the subjects based on their ADI-R scores shows separation of the autistic individuals into 2 main clusters , but did not clarify the phenotypic nature of each group of subjects (Fig. 1A). Hierarchical clustering (HCL) was therefore performed to obtain a broader sense of the structure of the ASD population as revealed by their respective scores on the selected ADI-R items. This analysis clearly shows separation of the individuals into more than 2 clusters, based upon symptomatic profile across the different items (Fig. 1B). A Figure of Merit (FOM) analysis which was employed to estimate the optimal number of clusters for supervised clustering analysis (Fig. 1C) suggested 3-5 clusters. We then performed K-means clustering of the subjects using each of these K-values (3-5), and concluded that 4 clusters gave optimal separation of recognizable phenotypes (Fig. 2A). For example, one group is characterized by severe language deficits (samples within this group were assigned the color Red for ease of individual identification), while another group (Blue) exhibits milder symptoms across the domains, as indicated by more black in the matrix, reflecting ADI-R severity scores of 0 (normal). A third group (Yellow) possesses noticeable savant skills, which are represented by the last 12 columns on the right of the score matrices, while the fourth group (Green) exhibited intermediate severity across the domains, but with relatively lower frequency of savant skills. When the subgroup color coding from the KMC analyses was applied to the graph obtained by principal components analysis (Fig. 1A), a clear, though not perfect, separation among the groups is observed (Fig. 2B). It is worth noting that the first 3 components of the PCA capture 38% of the variation among the samples (with 42% represented within the first 4 components). These results indicate that there is a large amount of variability in the ADI-R data as is evident from the many small branches in the hierarchical cluster which collectively contributes to the residual variance beyond that accounted for by the first 3-4 principal components. Yet, the separation of the ASD population into severely language impaired, intermediate, mild, and highly savant phenotypes is quite clear. A correspondence analysis (COA) of the data further suggests that specific clusters of items (e.g., savant skills, aggression, or ritualistic behaviors/resistance to change) are more strongly associated on the basis of higher severity scores with individuals in certain subgroups than in others (Fig. 3) and Table 1. For example, savant skills (turquoise squares) associate more with the “savant” (yellow) and mild (blue) individuals, while severe deficits in spoken language, nonverbal communication, and social skills (pink squares) are concentrated in the group with severe language impairment (red). Not surprisingly, we also see association of circumscribed interests and unusual preoccupations (lavender squares) with the mild (blue) and “savant” (yellow) groups, but these behavioral traits are also clustered with compulsions and ritualistic behaviors (Table 1). Interestingly, aggression and physiological symptoms (lime-colored squares) are associated with individuals exhibiting the mild (blue) phenotype of ASD.
Based upon these combined clustering methods, we selected LCL from individuals represented in 3 of the 4 phenotypic groups for gene expression analyses. These groups included those with severe language impairment, those with a milder phenotype (~40% of whom had clinical diagnoses of Asperger’s Syndrome or PDD-NOS), and those with notable savant skills. Because of the relatively low number of individuals in the “savant” category once other exclusion criteria were applied, we selected a few samples from the group with severe language impairment who also exhibited high scores on a majority of savant skills. It should be pointed out that those with savant skills were a minor fraction of the group with severe language impairment. Principal components and K-means cluster analyses of the ADI-R item scores for the individuals selected for the microarray studies confirm the separation of the selected samples into 4 phenotypic groups as described in the figure legend (Figs. 4A and 4B), with the fourth phenotypic group representing individuals with severe language deficits and savant skills (depicted by orange color in Fig. 4A).
Figure 5 shows the sum of ADI-R scores across all of the items used in this study for the selected individuals, as well as the sum of item scores specific for different functional domains. As shown by the inset within several of the graphs, the group selected for gene expression analysis typically mirrors that of the 1954 individuals from the repository (inset), suggesting that the selected individuals were phenotypically representative of the general autistic population. The profiles for other functional domains (e.g., nonverbal communication, play skills, restricted interests and behaviors) are similar to that representing the sum of all items, for all the individuals in the repository as well as the ones selected for microarray analyses. The average of item scores for each group across the items in each domain as well as the group averages of combined ADI-R scores across all items also confirms the phenotypic distinction among the groups (Figs. (Figs.66 and and7).7). Although there is no significant difference between the average of the sums of the ADI-R scores for the mild (blue) and savant (yellow) groups, the ADI-R score profiles in Figure 4B as well as in Fig. 7 show that there are indeed quantitative differences in severity among the phenotypic groups across multiple functional/behavioral domains, with the savant group showing lower severity scores than the mild group for almost all items except for savant skills. It is also interesting to note that while individuals in the mild AS group (blue) exhibit lower severity scores in the language domain, most of their scores in the social, nonverbal, and play categories are nearly as severe as those for individuals with severe language impairment (red), suggesting that higher language abilities do not necessarily correlate well with improved social skills (Fig. 4B and Fig. 7).
The primary goal of this study was to develop a method of directly clustering autistic probands according to similarity of severity scores across a broad range of behavioral and functional symptoms probed by the ADI-R in order to reduce the heterogeneity of samples for biological (specifically, gene expression) analyses. In this respect, our study differs from many other studies which have attempted to analyze the factor structure of the ADI-R (Constantino et al, 2004; Georgiades et al., 2007; Tadevosyan et al, 2003; Van Lang et al., 2006). These studies are comprehensively discussed in a recent study by Snow et al. who performed factor analyses on a majority of ADI-R items from scoresheets of both verbal and non-verbal autistic probands and concluded that autistic symptomatology can be best described by a two-domain model (Snow, Lecavalier, & Houts, 2008). The items comprising the larger of the two domains (Factor I) correspond to the items associated with the “pink” cluster in Fig. 3 (this study) which include all of the spoken language and nonverbal/social communication items (Table 1). The items comprising the second domain (Factor II) correspond roughly to the items identified by the “lavender” cluster in Fig. 3 and include restricted/repetitive behaviors, circumscribed interests, unusual preoccupations, and stereotypies (Table 1). Interestingly, the correspondence analysis (COA), which associates clusters of items with groups of individuals on the basis of severity scores, shows an association between the latter set of items (Factor II) with the mild and savant groups of ASD. In addition to identifying item clusters that correlate with Factors I and II in the Snow study, our COA also identified 2 additional clusters of items that appear to separate distinctly from those of the other 2 clusters (Fig. 3). One of these item clusters involves the savant skills items (turquoise colored squares in Fig. 3) which were not included by Snow et al. (Snow, Lecavalier, & Houts, 2008), but which were identified as a factor in the ADI-R by Tadevosyan-Leyfer et al. (2003). A subsequent genetic linkage analysis by Nurmi et al. (2003) based upon this factor demonstrated the value of subdividing the autistic probands and their families according to the savant phenotype. Not surprisingly, the items in this cluster associate most strongly with the savant and mild ASD individuals in our study. The fourth cluster of items in Fig. 3 (lime colored squares), which has not been extensively explored in the context of autism, includes items related to aggression and self-injury, and associates predominantly with a minority of individuals exhibiting the mild ASD phenotype. It will be of further interest to explore the co-expression and significance of these behavioral traits in specific phenotypes of ASD in future studies.
Another recent study by Rapin et al. argues that there are 2 major subtypes of language disorder in autistic children, differentiated mainly by impaired expressive phonology, with each subtype subdivided by comprehension ability (Rapin, Dunn, Allen, Stevens, & Fein, 2009). In this respect, the group with low phonology may be comparable to the group that we identify as “severely language impaired”, although there is no direct comparison of items analyzed. Indeed, we show in the accompanying article on gene expression analyses of several subtypes of ASD defined by this study, that the subgroup with severe language impairment exhibits the most differentially expressed genes relative to nonautistic controls and is the only subtype with significant dysregulation of circadian rhythm genes.
In contrast to the many studies which have sought to identify discrete phenotypes of autism which can be used to reduce heterogeneity for biological studies, Ring et al. have recently proposed a continuous gradient model in which the differences between autistic individuals is more quantitative than qualitative (Ring, Woodbury-Smith, Watson, Wheelwright, & Baron-Cohen, 2008). The study of Constantino et al. which shows a continuum in terms of a range of deficits based on scores on the Social Responsiveness Scale and ADI-R (Constantino et al., 2004) may also be used in support of this gradient concept. Our gene expression analyses of several of the ASD phenotypes identified here through cluster analyses of autistic probands based on their ADI-R scores, essentially offers support for both the discrete phenotype and the gradient models by identifying sets of genes that are differentially expressed either quantitatively or qualitatively among ASD subgroups relative to controls. These results respectively provide evidence for genes (common to 2 or more groups) responsible for core deficits across the spectrum that differ mainly in symptom severity as well as for genes (unique to a given subgroup) which implicate the involvement of different metabolic and/or signaling pathways among the phenotypes.
With respect to diagnosis of autism, the ADI-R is one of the most widely used and comprehensive diagnostic instruments for autism (Lord et al., 1994) and, to many, represents the “gold standard” for identifying individuals with ASD. However, it is only administered after a child presents with abnormal development (e.g., delayed speech) or aberrant behaviors, which typically is noticed between the ages of 2 and 3. Although many studies are currently attempting to identify even earlier signs of abnormal social development (e.g., lack of eye contact, pointing, or shared attention in toddlers (Landa, Holman, & Garrett-Mayer, 2007) ), there is still a need to identify definitive molecular markers of ASD that may be used to screen for autism even earlier (pre- or postnatally) as well as to provide targets for therapeutic intervention. We have therefore embarked upon a series of studies to identify expressed biomarkers of ASD through the use of large-scale gene expression analyses. Because ADI-R scores are the most widely available phenotypic data for the majority of autistic children (particularly within the AGRE repository), we sought to use the information in this test instrument as a starting point to subdivide diagnosed individuals for genomics analyses. We demonstrate in the accompanying manuscript that subgrouping of autistic individuals by multivariate cluster analysis of ADI-R scores which captures the breadth of the disorder within each individual reveals meaningful subgroups or phenotypes of idiopathic autism that can be separated from controls as well as distinguished from each other by gene expression profiling. Detailed bioinformatics analyses of the differentially expressed genes from the resulting subgroups reveal similarities as well as differences in pathways and functions associated with the different ASD phenotypes. Based on these combined and complementary analyses, we suggest that multivariate analysis of the ADI-R data using a broad spectrum of the ADI-R items and a combination of clustering methods that are typically employed in DNA microarray analyses may be an effective means of reducing the phenotypic heterogeneity of the sample population without restricting the phenotype to only one or a few items which, as pointed out by others (Hus et al., 2007; Lecavalier et al., 2006), may associate coincidentally with other variables. Such an approach towards stratification of individuals, which utilizes the full spectrum of autism-associated behaviors and can be easily tailored to include additional relevant scored items, is expected to aid in the association of genetic and other biological factors with specific forms of idiopathic autism. Finally, we suggest that similar cluster analyses of scored behavioral and functional evaluations may also be useful in reducing the heterogeneity of other complex, heterogeneous psychiatric disorders for genetic and other biological analyses.
Grant sponsors: National Institute of Mental Health, NIH, Grant # R21 MH073393 (VWH); Autism Speaks, Grant # 2381 (VWH)
We gratefully acknowledge the resources provided by the Autism Genetic Resource Exchange (AGRE) Consortium* and the participating AGRE families. The Autism Genetic Resource Exchange is a program of Autism Speaks and is supported, in part, by grant 1U24MH081810 from the National Institute of Mental Health to Clara M. Lajonchere (PI).
We especially thank Dr.Vlad Kustanovich of AGRE for providing us with additional information about the samples which were not easily retrievable in the database.
*The AGRE Consortium:
Dan Geschwind, M.D., Ph.D., UCLA, Los Angeles, CA;
Maja Bucan, Ph.D., University of Pennsylvania, Philadelphia, PA; W.Ted Brown, M.D., Ph.D., F.A.C.M.G., N.Y.S. Institute for Basic Research in Developmental Disabilities, Long Island, NY;
Rita M. Cantor, Ph.D., UCLA School of Medicine, Los Angeles, CA;
John N. Constantino, M.D., Washington University School of Medicine, St. Louis, MO; T.Conrad Gilliam, Ph.D., University of Chicago, Chicago, IL;
Martha Herbert, M.D., Ph.D., Harvard Medical School, Boston, MA;
Clara Lajonchere, Ph.D, Cure Autism Now, Los Angeles, CA;
David H. Ledbetter, Ph.D., Emory University, Atlanta, GA;
Christa Lese-Martin, Ph.D., Emory University, Atlanta, GA;
Janet Miller, J.D., Ph.D., Cure Autism Now, Los Angeles, CA;
Stanley F. Nelson, M.D., UCLA School of Medicine, Los Angeles, CA;
Gerard D. Schellenberg, Ph.D., University of Washington, Seattle, WA;
Carol A. Samango-Sprouse, Ed.D., George Washington University, Washington, D.C.;
Sarah Spence, M.D., Ph.D., UCLA, Los Angeles, CA;
Matthew State, M.D., Ph.D., Yale University , New Haven, CT.
Rudolph E. Tanzi, Ph.D., Massachusetts General Hospital, Boston, MA.