|Home | About | Journals | Submit | Contact Us | Français|
Autism spectrum disorder (ASD) is a heterogeneous neurodevelopmental disorder with a prevalence of 0.9–2.6%. Twin studies showed a heritability of 38–90%, indicating strong genetic contributions. Yet it is unclear how many genes have been associated with ASD and how strong the evidence is. A comprehensive review and analysis of literature and data may bring a clearer big picture of autism genetics. We show that as many as 2193 genes, 2806 SNPs/VNTRs, 4544 copy number variations (CNVs) and 158 linkage regions have been associated with ASD by GWAS, genome-wide CNV studies, linkage analyses, low-scale genetic association studies, expression profiling and other low-scale experimental studies. To evaluate the evidence, we collected metadata about each study including clinical and demographic features, experimental design and statistical significance, and used a scoring and ranking approach to select a core data set of 434 high-confidence genes. The genes mapped to pathways including neuroactive ligand–receptor interaction, synapse transmission and axon guidance. To better understand the genes we parsed over 30 databases to retrieve extensive data about expression patterns, protein interactions, animal models and pharmacogenetics. We constructed a MySQL-based online database and share it with the broader autism research community at http://autismkb.cbi.pku.edu.cn, supporting sophisticated browsing and searching functionalities.
Autism spectrum disorder (ASD) is a heterogeneous neurodevelopmental disorder characterized by impairments in reciprocal social interaction and communication and presence of restricted, repetitive and stereotyped patterns of behavior, interests and activities (1). ASD is an umbrella term for Autistic Disorder, Asperger Syndrome and Pervasive Developmental Disorder Not Otherwise Specified (PDD-NOS) (1). With an early onset prior to age 3 and a prevalence as high as 0.9–2.6% (2,3), ASD is one of the leading causes of childhood disability and inflicts serious suffering and burden for the family and society (4).
Understanding the causes of ASD is critical for developing better treatment. Twin studies have shown that the heritability of ASD is as high as 38–90%, indicating strong contributions by genetic factors as well as environmental factors (5,6). The search for environmental factors has not yet led to convincing major candidates whereas the search for genes associated with autism, although far from complete or conclusive, has been more fruitful. The genes discovered so far can be roughly grouped into two categories: ‘syndromic autism related genes’ or causal genes underlying genetic disorders that cause autistic symptoms such as Fragile X Syndrome, Rett Syndrome, Tuberous Sclerosis Complex and dozens of other disorders (7,8), and ‘non-syndromic autism related genes’ most of which are susceptibility genes (9). Many experimental methods have been used to identify associated genes, including the earlier linkage analyses and low-scale candidate gene association or experimental studies as well as the more recent genome-wide association studies (GWAS), genome-wide CNV studies and expression profiling.
With hundreds of studies published, especially the recent genome-wide studies, and with the next-generation sequencing technologies providing even more power for further gene discoveries (10), a new challenge has emerged: it has become more and more difficult for an autism researcher to answer with confidence how many genes have been associated with ASD, how strong the evidence is, what features the genes have and what pathways they involve. The amount of available literature and data and the intrinsic complexity of autism genetics demand bioinformatic data management and analysis. Three efforts have been made so far by different groups to collect genes and variations associated with ASD: AutDB (also known as SAFRI Gene) collected 219 genes (11,12), Autism genetic database (AGD) collected 226 genes and 743 CNVs (13) and Autism Chromosome Rearrangement Database (ACRD) collected 372 breakpoints and other genomic features (14). However, they are far from a comprehensive survey of autism genetics. To bring a clearer big picture of autism genetics, we performed a comprehensive review and analysis of published literature and data, described below, resulting in a total of 2193 genes, 2806 SNPs/VNTRs, 4544 CNVs and 158 linkage regions. We provide the results as an online resource for the broader autism research community at http://autismkb.cbi.pku.edu.cn/ with extensive evidence and annotations, supporting sophisticated browsing and searching functionalities.
We searched the PubMed database for publications related to autism genetics, using the query term ‘autism AND associat*’ for association studies, ‘autism AND (gene OR microarray OR proteomics)’ for expression profiling studies and the other low-scale experimental studies, and ‘autism AND (CNV OR copy number variation OR microarray* OR microdel* OR microdup* OR rearrange* OR (genome-wide AND (linkage OR associa* OR scan)))’ for CNV and linkage studies. The abstracts of the 4000+ articles retrieved were reviewed to remove irrelevant papers, resulting in a final set of 579 articles, reporting a total of 11 GWAS, 242 low-scale candidate gene association studies, 13 expression profiling studies, 95 genome-wide CNV studies, 23 genome-wide linkage analyses and 236 other low-scale experimental studies.
For syndromic autism-related genes, we first collected the autism-related disorders and their causal genes from a recently published comprehensive review (7). We then searched OMIM to get the official disease names and linked all the disorders to OMIM, and searched PubMed for additional citations using the query ‘(OMIM disease name) AND autism’ for each disease. All citations were double-checked manually. Finally, 99 genes for 94 autism-related disorders supported by 250 references were included in our data set of ‘Syndromic Autism Related Genes’.
In total, we collected as many as 2135 non-syndromic autism-related genes, 99 syndromic autism-related genes, 4544 CNVs and 158 linkage regions. The genes located in the CNV and linkage regions were then retrieved by the UCSC Genome Browser (15).
To establish the strength of evidence, we collected metadata about each study and result. Supplementary Table S1–S7 list the evidence collected for each type of experimental methods. In summary, for each study of non-syndromic autism, we collected the clinical and demographic features of the samples including ancestral background, country of origin, inclusion and exclusion criteria, number of cases and controls with gender ratio, age at examination and diagnosis criteria. We collected metadata about the experimental design including platform, experimental methods, statistical methods and statistical significance.
For each gene, we estimated how much evidence supports its role in autism by each type of experimental methods and calculated a weighted sum, following a multi-dimensional evidence-based candidate gene prioritization approach (16). First, we assigned initial scores to the genes for each type of experimental methods (Supplementary Table S8). Score 0 is given if there is no positive evidence of the type. Table 1 lists the distribution of the scores for each type. Next, we used a benchmark data set consisting of 21 non-syndromic autism-related genes considered high confidence from six autism reviews (8,9,17–20) (Supplementary Table S9) to calculate the weights. We followed a gene prioritization approach (16) to generate a candidate weight matrix pool consisting of dN=76 weight vectors, where N represents the number of experimental methods and d=N+1 represents possible different weights, 1–7 in the weight vectors. A combined score for each gene was then calculated by summing up the products of the scores and corresponding weights from the six experimental methods (16). All the 2135 candidate genes including 21 benchmark genes were sorted by their combined scores. We selected the weight matrix that gave the benchmark genes the highest rank as the optimal weight matrix (Supplementary Table S10). About 95% benchmark genes were ranked among the top 98% of all candidate genes. We chose the lowest combined score, 9, from the benchmark data set as the cutoff of high-confident genes, resulting in a core data set of 383 non-syndromic autism-related genes. Because the definition of ‘optimal weight matrix’ is always debatable, we provide an online ranking tool to allow users to re-rank the genes interactively by inputting customized weights based on their own experiences and preferences.
For syndromic autism, we assigned four levels to the autism-related disorders: Level 1 disorders have one reported case with autistic symptoms, Level 2 have two to three cases in a single family, Level 3 have cases in more than one family and Level 4 are reported in multiple review papers (8). Causal genes of Level 3 and 4 disorders were considered high-confident genes in the core dataset.
To better understand the function of the genes associated with autism, we collected extensive functional information and data, including crosslinks to NCBI Entrez gene (21), OMIM (21), Uniprot (http://www.uniprot.org/) and Ensembl (http://www.ensembl.org/), functional groups based on Gene Ontology (http://www.geneontology.org/), protein–protein interactions from database BioGRID (22), BIND (23) and HPRD (24), and genomic variants from the Database of Genomic Variants (DGV) (25). We linked the genes to three psychiatric disease databases, AlzGene (26), SzGene (27) and PDGene (http://www.pdgene.org/), when the gene is common between these diseases and ASD. Information about homologues of the genes were retrieved from Mouse Genome Informatics (MGI) (28), Zebrafish Model Organism Database (ZFIN) (29) and FlyBase (30). We collected comprehensive mRNA expression profiling data, including ESTs from NCBI Unigene Profiles (21), microarray expression profiles from BioGPS (31) and Allen Brain Atlas (32), and RNA-Seq (33–38). Protein expression evidence at peptide level was retrieved from PRIDE (39) and Peptide Atlas (40). We also collected transcription factor binding sites in the upstream regions of the genes from in-house collection of ChIP-Chip and ChIP-Seq data, miRNAs that may target the genes from miRWalk (41) and TarBase (42), and natural antisense transcripts that may regulate the genes from NATsDB (43). Possible post-translation modifications were retrieved from UniProt and dbPTM (44). We used KOBAS 2.0 (45) to retrieve the pathways that the genes are involved in from BioCyc (46), KEGG Pathway (47), PID (48), PID Reactome (48), PANTHER (49) and Reactome (50) and possible association with other diseases from Disease databases include KEGG Disease (51), FunDO (52,53), GAD (54), NHGRI GWAS Catalog (55) and OMIM (21). Pharmaco-genetics and drug information was collected from Comparative Toxicogenomics Database (CTD) (56), Pharmacogenomics Knowledge Base (57) and DrugBank (58). Supplementary Table S11 summarizes the gene coverage from each source database. The overlap between the genes discovered by expression profiling and those by the other genetic technologies is shown in Supplementary Table S12.
Enriched functional pathways were identified by KOBAS 2.0 (45) and enriched GO terms were identified by DAVID (59). Pathways such as neuroactive ligand–receptor interaction, synapse transmission, and axon guidance were statistically significantly enriched in the core data set (Table 2). In addition to synapse transmission, GO terms such as transmission of nerve impulse, neuron differentiation were also found to be statistically significant (Table 3). The result is consistent with recent findings that synapse development, axon targeting and neuron motility are related to autism etiology (60,61).
Users can browse the data in AutismKB in a variety of ways, including by data sets, experimental methods or chromosome. The gene lists include a summary of information about the genes, hyperlinked to detailed gene evidence and annotation pages. Figure 1 shows a typical AutismKB gene entry. Basic information such as gene symbol, gene name, cytoband and cross links are provided (Figure 1A). Nucleotide sequences and protein sequences can be sent to WebLab (62) for further analysis (Figure 1B). Summaries of supporting evidence and category-specific scores are provided (Figure 1C). Users can click on the hyperlinks of the category-specific score to view different category of evidences. The categories without any evidence are hidden by default (Figure 1D). Users can click on ‘+’ to expand or ‘−’ to collapse different categories. Detailed information of polymorphisms for low scale association studies and GWAS can be found by clicking on ‘detail’ in the tables (Figure 1E). When exploring other low-scale studies and large-scale expression studies, users can click the down arrow in the right of the table to obtain more information (Figure 1F). Annotations of each gene can be obtained by clicking the label ‘view annotation’ in the top left.
CNVs are provided by a tabular view with name, cytoband, gain or loss, number, evidence types and reference. Users can use evidence type and chromosome to filter the table (Figure 2A). Clicking on the name can bring the detail information of each CNV including the samples and methods of the study, CNV region, and any syndromic and non-syndromic autism genes in the region (Figure 2B). Users can use chromosome to filter the linkage regions and click on linkage name to view detailed information.
AutismKB supports both text-based search and sequence-based search. Users can find a quick search box on the top right of each page to search by gene symbol. Advanced search was provided to search genes, CNVs, linkage regions by gene name, gene symbol, NCBI Entrez id, Ensemble id, GO terms, UniProt ID, location, score, method and PubMed ID. Finally, a BLAST search against the nucleotide or protein sequences of all AutismKB genes is also available.
AutismKB is a comprehensive knowledgebase of autism-related genes, CNVs and linkage regions with extensive evidence and annotations. AutismKB will be updated periodically. We hope that it can be a valuable resource for the autism research community.
Supplementary Data are available at NAR Online: Supplementary Tables 1-12.
Funding for open access charge: National Outstanding Young Investigator Award from Natural Science Foundation of China (grant number: 31025014); 973 Basic Research Program (grant number: 2011CBA01102); scholarships from Merck and Johnson and Johnson.
Conflict of interest statement. None declared.
We thank Ge Gao, Chuan-Yun Li, Yong-Xin Ye and Ying-Fu Zhong for useful comments on the web interface.