1.  C9orf72 ablation causes immune dysregulation characterized by leukocyte expansion, autoantibody production, and glomerulonephropathy in mice 
Scientific Reports  2016;6:23204.
The expansion of a hexanucleotide (GGGGCC) repeat in C9ORF72 is the most common cause of amyotrophic lateral sclerosis (ALS) and frontotemporal dementia (FTD). Both the function of C9ORF72 and the mechanism by which the repeat expansion drives neuropathology are unknown. To examine whether C9ORF72 haploinsufficiency induces neurological disease, we created a C9orf72-deficient mouse line. Null mice developed a robust immune phenotype characterized by myeloid expansion, T cell activation, and increased plasma cells. Mice also presented with elevated autoantibodies and evidence of immune-mediated glomerulonephropathy. Collectively, our data suggest that C9orf72 regulates immune homeostasis and an autoimmune response reminiscent of systemic lupus erythematosus (SLE) occurs in its absence. We further imply that haploinsufficiency is unlikely to be the causative factor in C9ALS/FTD pathology.
PMCID: PMC4793236  PMID: 26979938
2.  Enrichment of Genetic Variants for Rheumatoid Arthritis within T-Cell and NK-Cell Enhancer Regions 
Molecular Medicine  2015;21(1):180-184.
To identify disease-causative variants, we intersected the published results of a metaanalysis of genome-wide association studies (GWAS) for rheumatoid arthritis (RA) with the set of enhancer regions for 71 primary cell types that was provided by the FANTOM consortium. We first retrieved all single nucleotide polymorphisms (SNPs) that are associated (P < 5 × 108) with RA in the GWAS meta-analysis and that are located in any of these enhancer regions. After excluding the major histocompatibility complex (MHC) region, we identified 50 such RA-associated SNPs that are located in enhancer regions. Enhancer sets from different cell types were then compared with each other for their number of RA-associated SNPs by permutation analysis. This analysis showed that RA-associated SNPs are preferentially located in enhancers from several immunological cell types. In particular, we see a strong relative enrichment in enhancer regions that are active in T cells (P < 0.001) and NK cells (P < 0.001). Several loci display multiple RA-associated SNPs in tight linkage disequilibrium that are located within the same or neighboring enhancers. These haplotypes may have a greater likelihood to influence enhancer activity than any SNP on its own. Taken together, these results support the hypothesis that RA-causative variants often act through altering the activity of immune cell enhancers. The enrichment in T-cell and NK-cell enhancer regions indicates that expression changes in these cell types are particularly relevant for the pathogenesis of RA. The specific SNPs that account for this enrichment can be used as a basis for focused genotype-phenotype studies of these cell types.
PMCID: PMC4503658  PMID: 25794145
3.  Mappability and read length 
Frontiers in Genetics  2014;5:381.
Power-law distributions are the main functional form for the distribution of repeat size and repeat copy number in the human genome. When the genome is broken into fragments for sequencing, the limited size of fragments and reads may prevent an unique alignment of repeat sequences to the reference sequence. Repeats in the human genome can be as long as 104 bases, or 105 − 106 bases when allowing for mismatches between repeat units. Sequence reads from these regions are therefore unmappable when the read length is in the range of 103 bases. With a read length of 1000 bases, slightly more than 1% of the assembled genome, and slightly less than 1% of the 1 kb reads, are unmappable, excluding the unassembled portion of the human genome (8% in GRCh37/hg19). The slow decay (long tail) of the power-law function implies a diminishing return in converting unmappable regions/reads to become mappable with the increase of the read length, with the understanding that increasing read length will always move toward the direction of 100% mappability.
PMCID: PMC4226227  PMID: 25426137
next-generation sequencing; repeats; mappability; power-law distribution; copy number variations
4.  Disease variants in genomes of 44 centenarians 
To identify previously reported disease mutations that are compatible with extraordinary longevity, we screened the coding regions of the genomes of 44 Ashkenazi Jewish centenarians. Individual genome sequences were generated with 30× coverage on the Illumina HiSeq 2000 and single-nucleotide variants were called with the genome analysis toolkit (GATK). We identified 130 coding variants that were annotated as “pathogenic” or “likely pathogenic” based on the ClinVar database and that are infrequent in the general population. These variants were previously reported to cause a wide range of degenerative, neoplastic, and cardiac diseases with autosomal dominant, autosomal recessive, and X-linked inheritance. Several of these variants are located in genes that harbor actionable incidental findings, according to the recommendations of the American College of Medical Genetics. In addition, we found risk variants for late-onset neurodegenerative diseases, such as the APOE ε4 allele that was even present in a homozygous state in one centenarian who did not develop Alzheimer's disease. Our data demonstrate that the incidental finding of certain reported disease variants in an individual genome may not preclude an extraordinarily long life. When the observed variants are encountered in the context of clinical sequencing, it is thus important to exercise caution in justifying clinical decisions. In genome sequences of 44 Ashkenazi centenarians, we identified many coding variants that were annotated as “pathogenic” or “likely pathogenic” based on the ClinVar database. Our data demonstrate that the incidental finding of certain reported disease variants in an individual genome may not preclude an extraordinarily long life. When the observed variants are encountered in the context of clinical sequencing, it is thus important to exercise caution in justifying clinical decisions.
PMCID: PMC4190879  PMID: 25333069
Aging; Ashkenazi; centenarian; disease gene; incidental finding; whole genome sequencing
5.  Absolute pitch exhibits phenotypic and genetic overlap with synesthesia 
Human Molecular Genetics  2013;22(10):2097-2104.
Absolute pitch (AP) and synesthesia are two uncommon cognitive traits that reflect increased neuronal connectivity and have been anecdotally reported to occur together in an individual. Here we systematically evaluate the occurrence of synesthesia in a population of 768 subjects with documented AP. Out of these 768 subjects, 151 (20.1%) reported synesthesia, most commonly with color. These self-reports of synesthesia were validated in a subset of 21 study subjects, using an established methodology. We further carried out combined linkage analysis of 53 multiplex families with AP and 36 multiplex families with synesthesia. We observed a peak NPL LOD = 4.68 on chromosome 6q, as well as evidence of linkage on chromosome 2, using a dominant model. These data establish the close phenotypic and genetic relationship between AP and synesthesia. The chromosome 6 linkage region contains 73 genes; several leading candidate genes involved in neurodevelopment were investigated by exon resequencing. However, further studies will be required to definitively establish the identity of the causative gene(s) in the region.
PMCID: PMC4707203  PMID: 23406871
6.  Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome 
BMC Bioinformatics  2014;15:2.
The amount of non-unique sequence (non-singletons) in a genome directly affects the difficulty of read alignment to a reference assembly for high throughput-sequencing data. Although a longer read is more likely to be uniquely mapped to the reference genome, a quantitative analysis of the influence of read lengths on mappability has been lacking. To address this question, we evaluate the k-mer distribution of the human reference genome. The k-mer frequency is determined for k ranging from 20 bp to 1000 bp.
We observe that the proportion of non-singletons k-mers decreases slowly with increasing k, and can be fitted by piecewise power-law functions with different exponents at different ranges of k. A slower decay at greater values for k indicates more limited gains in mappability for read lengths between 200 bp and 1000 bp. The frequency distributions of k-mers exhibit long tails with a power-law-like trend, and rank frequency plots exhibit a concave Zipf’s curve. The most frequent 1000-mers comprise 172 regions, which include four large stretches on chromosomes 1 and X, containing genes of biomedical relevance. Comparison with other databases indicates that the 172 regions can be broadly classified into two types: those containing LINE transposable elements and those containing segmental duplications.
Read mappability as measured by the proportion of singletons increases steadily up to the length scale around 200 bp. When read length increases above 200 bp, smaller gains in mappability are expected. Moreover, the proportion of non-singletons decreases with read lengths much slower than linear. Even a read length of 1000 bp would not allow the unique alignment of reads for many coding regions of human genes. A mix of techniques will be needed for efficiently producing high-quality data that cover the complete human genome.
PMCID: PMC3927684  PMID: 24386976
Next-generation sequencing; Read alignment; Repeat sequences; Genome redundancy; Long-tail distribution; k-mers
7.  CSK regulatory polymorphism is associated with systemic lupus erythematosus and influences B cell signaling and activation 
Nature genetics  2012;44(11):1227-1230.
C-src tyrosine kinase, Csk, physically interacts with the intracellular phosphatase Lyp (PTPN22) and can modify the activation state of downstream Src kinases, such as Lyn, in lymphocytes. We identified an association of Csk with systemic lupus erythematosus (SLE) and refined its location to an intronic polymorphism rs34933034 (OR 1.32, p = 1.04 × 10−9). The risk allele is associated with increased CSK expression and augments inhibitory phosphorylation of Lyn. In carriers of the risk allele, B cell receptor (BCR)-mediated activation of mature B cells, as well as plasma IgM, are increased. Moreover, the fraction of transitional B cells is doubled in the cord blood of carriers of the risk allele compared to non-risk haplotypes due to an expansion of the late transitional cells, a stage targeted by selection mechanisms. This suggests that the Lyp-Csk complex increases susceptibility to lupus at multiple maturation and activation points of B cells.
PMCID: PMC3715052  PMID: 23042117
8.  Five amino acids in three HLA proteins explain most of the association between MHC and seropositive rheumatoid arthritis 
Nature genetics  2012;44(3):291-296.
The genetic association of the major histocompatibility complex (MHC) to rheumatoid arthritis risk has commonly been attributed to HLA-DRB1 alleles. Yet controversy persists about the causal variants in HLA-DRB1 and the presence of independent effects elsewhere in the MHC. Using existing genome-wide SNP data in 5,018 seropositive cases and 14,974 controls, we imputed and tested classical alleles and amino acid polymorphisms for HLA-A, B, C, DPA1, DPB1, DQA1, DQB1, and DRB1 along with 3,117 SNPs across the MHC. Conditional and haplotype analyses reveal that three amino acid positions (11, 71 and 74) in HLA-DRβ1, and single amino acid polymorphisms in HLA-B (position 9) and HLA-DPβ1 (position 9), all located in the peptide-binding grooves, almost completely explain the MHC association to disease risk. This study illustrates how imputation of functional variation from large reference panels can help fine-map association signals in the MHC.
PMCID: PMC3288335  PMID: 22286218
9.  A Simple Method for Analyzing Exome Sequencing Data Shows Distinct Levels of Nonsynonymous Variation for Human Immune and Nervous System Genes 
PLoS ONE  2012;7(6):e38087.
To measure the strength of natural selection that acts upon single nucleotide variants (SNVs) in a set of human genes, we calculate the ratio between nonsynonymous SNVs (nsSNVs) per nonsynonymous site and synonymous SNVs (sSNVs) per synonymous site. We transform this ratio with a respective factor f that corrects for the bias of synonymous sites towards transitions in the genetic code and different mutation rates for transitions and transversions. This method approximates the relative density of nsSNVs (rdnsv) in comparison with the neutral expectation as inferred from the density of sSNVs. Using SNVs from a diploid genome and 200 exomes, we apply our method to immune system genes (ISGs), nervous system genes (NSGs), randomly sampled genes (RSGs), and gene ontology annotated genes. The estimate of rdnsv in an individual exome is around 20% for NSGs and 30–40% for ISGs and RSGs. This smaller rdnsv of NSGs indicates overall stronger purifying selection. To quantify the relative shift of nsSNVs towards rare variants, we next fit a linear regression model to the estimates of rdnsv over different SNV allele frequency bins. The obtained regression models show a negative slope for NSGs, ISGs and RSGs, supporting an influence of purifying selection on the frequency spectrum of segregating nsSNVs. The y-intercept of the model predicts rdnsv for an allele frequency close to 0. This parameter can be interpreted as the proportion of nonsynonymous sites where mutations are tolerated to segregate with an allele frequency notably greater than 0 in the population, given the performed normalization of the observed nsSNV to sSNV ratio. A smaller y-intercept is displayed by NSGs, indicating more nonsynonymous sites under strong negative selection. This predicts more monogenically inherited or de-novo mutation diseases that affect the nervous system.
PMCID: PMC3368947  PMID: 22701602
10.  Locus category based analysis of a large genome-wide association study of rheumatoid arthritis 
Human Molecular Genetics  2010;19(19):3863-3872.
To pinpoint true positive single-nucleotide polymorphism (SNP) associations in a genome-wide association study (GWAS) of rheumatoid arthritis (RA), we categorize genetic loci by external knowledge. We test both the ‘enrichment of associated loci’ in a locus category and the ‘combined association’ of a locus category. The former is quantified by the odds ratio for the presence of SNP associations at the loci of a category, whereas the latter is quantified by the number of loci in a category that have SNP associations. These measures are compared with their expected values as obtained from the permutation of the affection status. To account for linkage disequilibrium (LD) among SNPs, we view each LD block as a genetic locus. Positional candidates were defined as loci implicated by earlier GWAS results, whereas functional candidates were defined by annotations regarding the molecular roles of genes, such as gene ontology categories. As expected, immune-related categories show the largest enrichment signal, although it is not very strong. The intersection of positional and functional candidate information predicts novel RA loci near the genes TEC/TXK, MBL2 and PIK3R1/CD180. Notably, a combined association signal is not only produced by immune-related categories, but also by most other categories and even randomly defined categories. The unspecific quality of these signals limits the possible conclusions from combined association tests. It also reduces the magnitude of enrichment test results. These unspecific signals might result from common variants of small effect and hardly concentrated in candidate categories, or an inflated size of associated regions from weak LD with infrequent mutations.
PMCID: PMC2935861  PMID: 20639398
11.  Refining the association of MHC with multiple sclerosis in African Americans 
Human Molecular Genetics  2010;19(15):3080-3088.
Multiple sclerosis (MS) is a common demyelinating disease of the central nervous system mediated by autoimmune and neurodegenerative pathogenic mechanisms. Multiple genes account for its moderate heritability, but the only genetic region shown to have a large replicable effect on MS susceptibility is the major histocompatibility complex (MHC). Strong linkage disequilibrium (LD) across the MHC has made it difficult to fully characterize individual genetic contributions of this region to MS risk in previous studies. African Americans are at a lower risk for MS when compared with northern Europeans and Americans of European descent, but greater haplotypic diversity and distinct patterns of LD suggest that this population may be particularly informative for fine-mapping efforts. To examine the role of the MHC in African American MS, a case–control association study was performed with 499 African American MS patients and 750 African American controls that were genotyped for 6040 MHC region single nucleotide polymorphisms (SNPs). A replication data set consisting of 451 African American patients and 718 African American controls was genotyped for selected SNPs. Two MHC class II SNPs, rs2647040 and rs3135021, were significant in the replication cohort and partially tagged DRB1*15 alleles. Surprisingly, in comparison to similar studies of individuals of European descent, the MHC seems to play a smaller role in MS susceptibility in African Americans, consistent with pervasive genetic heterogeneity across ancestral groups, and may explain the difference in MS susceptibility between African Americans and individuals of European descent.
PMCID: PMC2901136  PMID: 20466734
12.  Functionally defective germline variants of sialic acid acetylesterase in autoimmunity 
Nature  2010;466(7303):243-247.
Sialic acid acetylesterase (SIAE) is an enzyme that negatively regulates B lymphocyte antigen receptor signaling and is required for the maintenance of immunological tolerance in mice1, 2. Heterozygous loss-of-function germline rare variants and a homozygous defective polymorphic variant of SIAE were identified in 24/923 Caucasian subjects with relatively common autoimmune disorders and in 2/648 Caucasian controls. All heterozygous loss-of-function SIAE mutations tested were capable of functioning in a dominant negative manner. A homozygous secretion-defective polymorphic variant of SIAE was catalytically active, lacked the ability to function in a dominant negative manner, and was seen in 8 autoimmune subjects but in no control subjects. The Odds Ratio for inheriting defective SIAE alleles was 8.6 in all autoimmune subjects, 8.3 in subjects with rheumatoid arthritis, and 7.9 in subjects with type I diabetes. Functionally defective SIAE rare and polymorphic variants represent a strong genetic link to susceptibility in relatively common human autoimmune disorders.
PMCID: PMC2900412  PMID: 20555325
13.  Recent positive selection of a human androgen receptor/ectodysplasin A2 receptor haplotype and its relationship to male pattern baldness 
Human Genetics  2009;126(2):255-264.
Genetic variants in the human androgen receptor gene (AR) are associated with male pattern baldness (androgenetic alopecia, AGA) in Europeans. Previous observations of long-range linkage disequilibrium at the AR locus are consistent with the hypothesis of recent positive selection. Here, we further investigate this signature and its relationship to the AGA risk haplotype. The haplotype homozygosity suggests that the AGA risk haplotype was driven to high frequency by positive selection in Europeans although a low meiotic recombination rate contributed to the high haplotype homozygosity. Further, we find high levels of population differentiation as measured by FST and a series of fixed derived alleles along an extended region centromeric to AR in the Asian HapMap sample. The predominant AGA risk haplotype also carries the putatively functional variant 57K in the flanking ectodysplasin A2 receptor gene (EDA2R). It is therefore probable that the AGA risk haplotype rose to high frequency in combination with this EDA2R variant, possibly by hitchhiking on a positively selected 57K haplotype.
Electronic supplementary material
The online version of this article (doi:10.1007/s00439-009-0668-z) contains supplementary material, which is available to authorized users.
PMCID: PMC3774421  PMID: 19373488
14.  Partial correlation analysis indicates causal relationships between GC-content, exon density and recombination rate in the human genome 
BMC Bioinformatics  2009;10(Suppl 1):S66.
Several features are known to correlate with the GC-content in the human genome, including recombination rate, gene density and distance to telomere. However, by testing for pairwise correlation only, it is impossible to distinguish direct associations from indirect ones and to distinguish between causes and effects.
We use partial correlations to construct partially directed graphs for the following four variables: GC-content, recombination rate, exon density and distance-to-telomere. Recombination rate and exon density are unconditionally uncorrelated, but become inversely correlated by conditioning on GC-content. This pattern indicates a model where recombination rate and exon density are two independent causes of GC-content variation.
Causal inference and graphical models are useful methods to understand genome evolution and the mechanisms of isochore evolution in the human genome.
PMCID: PMC2648766  PMID: 19208170
15.  Transancestral mapping of the MHC region in systemic lupus erythematosus identifies new independent and interacting loci at MSH5, HLA-DPB1 and HLA-G 
Annals of the Rheumatic Diseases  2012;71(5):777-784.
Systemic lupus erythematosus (SLE) is a chronic multisystem genetically complex autoimmune disease characterised by the production of autoantibodies to nuclear and cellular antigens, tissue inflammation and organ damage. Genome-wide association studies have shown that variants within the major histocompatibility complex (MHC) region on chromosome 6 confer the greatest genetic risk for SLE in European and Chinese populations. However, the causal variants remain elusive due to tight linkage disequilibrium across disease-associated MHC haplotypes, the highly polymorphic nature of many MHC genes and the heterogeneity of the SLE phenotype.
A high-density case-control single nucleotide polymorphism (SNP) study of the MHC region was undertaken in SLE cohorts of Spanish and Filipino ancestry using a custom Illumina chip in order to fine-map association signals in these haplotypically diverse populations. In addition, comparative analyses were performed between these two datasets and a northern European UK SLE cohort. A total of 1433 cases and 1458 matched controls were examined.
Using this transancestral SNP mapping approach, novel independent loci were identified within the MHC region in UK, Spanish and Filipino patients with SLE with some evidence of interaction. These loci include HLA-DPB1, HLA-G and MSH5 which are independent of each other and HLA-DRB1 alleles. Furthermore, the established SLE-associated HLA-DRB1*15 signal was refined to an interval encompassing HLA-DRB1 and HLA-DQA1. Increased frequencies of MHC region risk alleles and haplotypes were found in the Filipino population compared with Europeans, suggesting that the greater disease burden in non-European SLE may be due in part to this phenomenon.
These data highlight the usefulness of mapping disease susceptibility loci using a transancestral approach, particularly in a region as complex as the MHC, and offer a springboard for further fine-mapping, resequencing and transcriptomic analysis.
PMCID: PMC3329227  PMID: 22233601

