Search tips
Search criteria

Results 1-25 (1106229)

Clipboard (0)

Related Articles

1.  HapTree: A Novel Bayesian Framework for Single Individual Polyplotyping Using NGS Data 
PLoS Computational Biology  2014;10(3):e1003502.
As the more recent next-generation sequencing (NGS) technologies provide longer read sequences, the use of sequencing datasets for complete haplotype phasing is fast becoming a reality, allowing haplotype reconstruction of a single sequenced genome. Nearly all previous haplotype reconstruction studies have focused on diploid genomes and are rarely scalable to genomes with higher ploidy. Yet computational investigations into polyploid genomes carry great importance, impacting plant, yeast and fish genomics, as well as the studies of the evolution of modern-day eukaryotes and (epi)genetic interactions between copies of genes. In this paper, we describe a novel maximum-likelihood estimation framework, HapTree, for polyploid haplotype assembly of an individual genome using NGS read datasets. We evaluate the performance of HapTree on simulated polyploid sequencing read data modeled after Illumina sequencing technologies. For triploid and higher ploidy genomes, we demonstrate that HapTree substantially improves haplotype assembly accuracy and efficiency over the state-of-the-art; moreover, HapTree is the first scalable polyplotyping method for higher ploidy. As a proof of concept, we also test our method on real sequencing data from NA12878 (1000 Genomes Project) and evaluate the quality of assembled haplotypes with respect to trio-based diplotype annotation as the ground truth. The results indicate that HapTree significantly improves the switch accuracy within phased haplotype blocks as compared to existing haplotype assembly methods, while producing comparable minimum error correction (MEC) values. A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2–5.
Author Summary
While human and other eukaryotic genomes typically contain two copies of every chromosome, plants, yeast and fish such as salmon can have strictly more than two copies of each chromosome. By running standard genotype calling tools, it is possible to accurately identify the number of “wild type” and “mutant” alleles (A, C, G, or T) for each single-nucleotide polymorphism (SNP) site. However, in the case of two heterozygous SNP sites, genotype calling tools cannot determine whether “mutant” alleles from different SNP loci are on the same or different chromosomes. While the former would be healthy, in many cases the latter can cause loss of function; it is therefore necessary to identify the phase—the copies of a chromosome on which the mutant alleles occur—in addition to the genotype. This necessitates efficient algorithms to obtain accurate and comprehensive phase information directly from the next-generation-sequencing read data in higher ploidy species. We introduce an efficient statistical method for this task and show that our method significantly outperforms previous ones, in both accuracy and speed, for phasing triploid and higher ploidy genomes. Our method performs well on human diploid genomes as well, as demonstrated by our improved phasing of the well known NA12878 (1000 Genomes Project).
PMCID: PMC3967924  PMID: 24675685
2.  The DNA Methylome of Human Peripheral Blood Mononuclear Cells 
PLoS Biology  2010;8(11):e1000533.
Analysis across the genome of patterns of DNA methylation reveals a rich landscape of allele-specific epigenetic modification and consequent effects on allele-specific gene expression.
DNA methylation plays an important role in biological processes in human health and disease. Recent technological advances allow unbiased whole-genome DNA methylation (methylome) analysis to be carried out on human cells. Using whole-genome bisulfite sequencing at 24.7-fold coverage (12.3-fold per strand), we report a comprehensive (92.62%) methylome and analysis of the unique sequences in human peripheral blood mononuclear cells (PBMC) from the same Asian individual whose genome was deciphered in the YH project. PBMC constitute an important source for clinical blood tests world-wide. We found that 68.4% of CpG sites and <0.2% of non-CpG sites were methylated, demonstrating that non-CpG cytosine methylation is minor in human PBMC. Analysis of the PBMC methylome revealed a rich epigenomic landscape for 20 distinct genomic features, including regulatory, protein-coding, non-coding, RNA-coding, and repeat sequences. Integration of our methylome data with the YH genome sequence enabled a first comprehensive assessment of allele-specific methylation (ASM) between the two haploid methylomes of any individual and allowed the identification of 599 haploid differentially methylated regions (hDMRs) covering 287 genes. Of these, 76 genes had hDMRs within 2 kb of their transcriptional start sites of which >80% displayed allele-specific expression (ASE). These data demonstrate that ASM is a recurrent phenomenon and is highly correlated with ASE in human PBMCs. Together with recently reported similar studies, our study provides a comprehensive resource for future epigenomic research and confirms new sequencing technology as a paradigm for large-scale epigenomics studies.
Author Summary
Epigenetic modifications such as addition of methyl groups to cytosine in DNA play a role in regulating gene expression. To better understand these processes, knowledge of the methylation status of all cytosine bases in the genome (the methylome) is required. DNA methylation can differ between the two gene copies (alleles) in each cell. Such allele-specific methylation (ASM) can be due to parental origin of the alleles (imprinting), X chromosome inactivation in females, and other as yet unknown mechanisms. This may significantly alter the expression profile arising from different allele combinations in different individuals. Using advanced sequencing technology, we have determined the methylome of human peripheral blood mononuclear cells (PBMC). Importantly, the PBMC were obtained from the same male Han Chinese individual whose complete genome had previously been determined. This allowed us, for the first time, to study genome-wide differences in ASM. Our analysis shows that ASM in PBMC is higher than can be accounted for by regions known to undergo parent-of-origin imprinting and frequently (>80%) correlates with allele-specific expression (ASE) of the corresponding gene. In addition, our data reveal a rich landscape of epigenomic variation for 20 genomic features, including regulatory, coding, and non-coding sequences, and provide a valuable resource for future studies. Our work further establishes whole-genome sequencing as an efficient method for methylome analysis.
PMCID: PMC2976721  PMID: 21085693
3.  Allele-Specific KRT1 Expression Is a Complex Trait 
PLoS Genetics  2006;2(6):e93.
The differential expression of alleles occurs commonly in humans and is likely an important genetic factor underlying heritable differences in phenotypic traits. Understanding the molecular basis of allelic expression differences is thus an important challenge. Although many genes have been shown to display differential allelic expression, this is the first study to examine in detail the cumulative effects of multiple cis-regulatory polymorphisms responsible for allele-specific expression differences. We have used a variety of experimental approaches to identify and characterize cis-regulatory polymorphisms responsible for the extreme allele-specific expression differences of keratin-1 (KRT1) in human white blood cells. The combined data from our analyses provide strong evidence that the KRT1 allelic expression differences result from the haplotypic combinations and interactions of five cis-regulatory single nucleotide polymorphisms (SNPs) whose alleles differ in their affinity to bind transcription factors and modulate KRT1 promoter activity. Two of these cis-regulatory SNPs bind transcriptional activators with the alleles on the high-expressing KRT1 haplotype pattern having a higher affinity than the alleles on the low-expressing haplotype pattern. In contrast, the other three cis-regulatory SNPs bind transcriptional inhibitors with the alleles on the low-expressing haplotype pattern having a higher affinity than the alleles on the high-expressing haplotype pattern. Our study provides important new insights into the degree of complexity that the cis-regulatory sequences responsible for allele-specific transcriptional regulation have. These data suggest that allelic expression differences result from the cumulative contribution of multiple DNA sequence polymorphisms, with each having a small effect, and that allele-specific expression can thus be viewed as a complex trait.
Despite the fact that all humans share nearly identical DNA sequences, individuals exhibit tremendous variation in heritable traits, such as height, weight, and skin texture. Recent evidence suggests that expression level differences between different copies (alleles) of a gene contribute to these observed differences in heritable traits. Currently, the mechanisms underlying allele-expression level differences are poorly understood. In this report the authors identified and characterized a set of five single nucleotide polymorphisms (SNPs) contributing to extreme expression differences between keratin-1 (KRT1) alleles in humans. Each of the five SNPs is found in a different regulatory sequence in the proximity of KRT1. The SNPs cause different copies of the five regulatory sequences to differ in their affinities to bind transcription factors controlling KRT1 expression. The extreme KRT1 allele-expression level differences result from the cumulative contributions of these five SNPs which are tightly linked and inherited in two common fixed sets, a low- and a high-expressing set. The study provides important new insights into the complexities of the mechanisms underlying allele-expression level differences. These complexities may explain the difficulties researchers frequently encounter when trying to discover the “causative SNP” in an interval identified as associated with an inherited trait in a genetic study.
PMCID: PMC1475705  PMID: 16789827
4.  Rapid gene-based SNP and haplotype marker development in non-model eukaryotes using 3'UTR sequencing 
BMC Genomics  2012;13:18.
Sweet cherry (Prunus avium L.), a non-model crop with narrow genetic diversity, is an important member of sub-family Amygdoloideae within Rosaceae. Compared to other important members like peach and apple, sweet cherry lacks in genetic and genomic information, impeding understanding of important biological processes and development of efficient breeding approaches. Availability of single nucleotide polymorphism (SNP)-based molecular markers can greatly benefit breeding efforts in such non-model species. RNA-seq approaches employing second generation sequencing platforms offer a unique avenue to rapidly identify gene-based SNPs. Additionally, haplotype markers can be rapidly generated from transcript-based SNPs since they have been found to be extremely utile in identification of genetic variants related to health, disease and response to environment as highlighted by the human HapMap project.
RNA-seq was performed on two sweet cherry cultivars, Bing and Rainier using a 3' untranslated region (UTR) sequencing method yielding 43,396 assembled contigs. In order to test our approach of rapid identification of SNPs without any reference genome information, over 25% (10,100) of the contigs were screened for the SNPs. A total of 207 contigs from this set were identified to contain high quality SNPs. A set of 223 primer pairs were designed to amplify SNP containing regions from these contigs and high resolution melting (HRM) analysis was performed with eight important parental sweet cherry cultivars. Six of the parent cultivars were distantly related to Bing and Rainier, the cultivars used for initial SNP discovery. Further, HRM analysis was also performed on 13 seedlings derived from a cross between two of the parents. Our analysis resulted in the identification of 84 (38.7%) primer sets that demonstrated variation among the tested germplasm. Reassembly of the raw 3'UTR sequences using upgraded transcriptome assembly software yielded 34,620 contigs containing 2243 putative SNPs in 887 contigs after stringent filtering. Contigs with multiple SNPs were visually parsed to identify 685 putative haplotypes at 335 loci in 301 contigs.
This approach, which leverages the advantages of RNA-seq approaches, enabled rapid generation of gene-linked SNP and haplotype markers. The general approach presented in this study can be easily applied to other non-model eukaryotes irrespective of the ploidy level to identify gene-linked polymorphisms that are expected to facilitate efficient Gene Assisted Breeding (GAB), genotyping and population genetics studies. The identified SNP haplotypes reveal some of the allelic differences in the two sweet cherry cultivars analyzed. The identification of these SNP and haplotype markers is expected to significantly improve the genomic resources for sweet cherry and facilitate efficient GAB in this non-model crop.
PMCID: PMC3293726  PMID: 22239826
5.  AlleleSeq: analysis of allele-specific expression and binding in a network framework 
A computational pipeline for constructing a personal diploid genome and determining sites of allele-specific activity is developed. Using a regulatory network framework, allele-specific binding and expression are found to be significantly coordinated across the genome.
Software was developed for building a personal diploid genome sequence, and determining sites of allele-specific binding and expression (AlleleSeq).This computational pipeline was used to analyze variation data, and deeply sequenced RNA-Seq and ChIP-Seq datasets, for individual NA12878 from the 1000 Genomes Project.The interaction between allele-specific binding and allele-specific expression are investigated, revealing clear coordination.
To study allele-specific expression (ASE) and binding (ASB), that is, differences between the maternally and paternally derived alleles, we have developed a computational pipeline (AlleleSeq). Our pipeline initially constructs a diploid personal genome sequence (and corresponding personalized gene annotation) using genomic sequence variants (SNPs, indels, and structural variants), and then identifies allele-specific events with significant differences in the number of mapped reads between maternal and paternal alleles. There are many technical challenges in the construction and alignment of reads to a personal diploid genome sequence that we address, for example, bias of reads mapping to the reference allele. We have applied AlleleSeq to variation data for NA12878 from the 1000 Genomes Project as well as matched, deeply sequenced RNA-Seq and ChIP-Seq data sets generated for this purpose. In addition to observing fairly widespread allele-specific behavior within individual functional genomic data sets (including results consistent with X-chromosome inactivation), we can study the interaction between ASE and ASB. Furthermore, we investigate the coordination between ASE and ASB from multiple transcription factors events using a regulatory network framework. Correlation analyses and network motifs show mostly coordinated ASB and ASE.
PMCID: PMC3208341  PMID: 21811232
allele-specific; ChIP-Seq; networks; RNA-Seq
6.  Reliable allele detection using SNP-based PCR primers containing Locked Nucleic Acid: application in genetic mapping 
Plant Methods  2007;3:2.
The diploid, Solanum caripense, a wild relative of potato and tomato, possesses valuable resistance to potato late blight and we are interested in the genetic base of this resistance. Due to extremely low levels of genetic variation within the S. caripense genome it proved impossible to generate a dense genetic map and to assign individual Solanum chromosomes through the use of conventional chromosome-specific SSR, RFLP, AFLP, as well as gene- or locus-specific markers. The ease of detection of DNA polymorphisms depends on both frequency and form of sequence variation. The narrow genetic background of close relatives and inbreds complicates the detection of persisting, reduced polymorphism and is a challenge to the development of reliable molecular markers. Nonetheless, monomorphic DNA fragments representing not directly usable conventional markers can contain considerable variation at the level of single nucleotide polymorphisms (SNPs). This can be used for the design of allele-specific molecular markers. The reproducible detection of allele-specific markers based on SNPs has been a technical challenge.
We present a fast and cost-effective protocol for the detection of allele-specific SNPs by applying Sequence Polymorphism-Derived (SPD) markers. These markers proved highly efficient for fingerprinting of individuals possessing a homogeneous genetic background. SPD markers are obtained from within non-informative, conventional molecular marker fragments that are screened for SNPs to design allele-specific PCR primers. The method makes use of primers containing a single, 3'-terminal Locked Nucleic Acid (LNA) base. We demonstrate the applicability of the technique by successful genetic mapping of allele-specific SNP markers derived from monomorphic Conserved Ortholog Set II (COSII) markers mapped to Solanum chromosomes, in S. caripense. By using SPD markers it was possible for the first time to map the S. caripense alleles of 16 chromosome-specific COSII markers and to assign eight of the twelve linkage groups to consensus Solanum chromosomes.
The method based on individual allelic variants allows for a level-of-magnitude higher resolution of genetic variation than conventional marker techniques. We show that the majority of monomorphic molecular marker fragments from organisms with reduced heterozygosity levels still contain SNPs that are sufficient to trace individual alleles.
PMCID: PMC1802836  PMID: 17286854
7.  Polymorphisms, Mutations, and Amplification of the EGFR Gene in Non-Small Cell Lung Cancers 
PLoS Medicine  2007;4(4):e125.
The epidermal growth factor receptor (EGFR) gene is the prototype member of the type I receptor tyrosine kinase (TK) family and plays a pivotal role in cell proliferation and differentiation. There are three well described polymorphisms that are associated with increased protein production in experimental systems: a polymorphic dinucleotide repeat (CA simple sequence repeat 1 [CA-SSR1]) in intron one (lower number of repeats) and two single nucleotide polymorphisms (SNPs) in the promoter region, −216 (G/T or T/T) and −191 (C/A or A/A). The objective of this study was to examine distributions of these three polymorphisms and their relationships to each other and to EGFR gene mutations and allelic imbalance (AI) in non-small cell lung cancers.
Methods and Findings
We examined the frequencies of the three polymorphisms of EGFR in 556 resected lung cancers and corresponding non-malignant lung tissues from 336 East Asians, 213 individuals of Northern European descent, and seven of other ethnicities. We also studied the EGFR gene in 93 corresponding non-malignant lung tissue samples from European-descent patients from Italy and in peripheral blood mononuclear cells from 250 normal healthy US individuals enrolled in epidemiological studies including individuals of European descent, African–Americans, and Mexican–Americans. We sequenced the four exons (18–21) of the TK domain known to harbor activating mutations in tumors and examined the status of the CA-SSR1 alleles (presence of heterozygosity, repeat number of the alleles, and relative amplification of one allele) and allele-specific amplification of mutant tumors as determined by a standardized semiautomated method of microsatellite analysis. Variant forms of SNP −216 (G/T or T/T) and SNP −191 (C/A or A/A) (associated with higher protein production in experimental systems) were less frequent in East Asians than in individuals of other ethnicities (p < 0.001). Both alleles of CA-SSR1 were significantly longer in East Asians than in individuals of other ethnicities (p < 0.001). Expression studies using bronchial epithelial cultures demonstrated a trend towards increased mRNA expression in cultures having the variant SNP −216 G/T or T/T genotypes. Monoallelic amplification of the CA-SSR1 locus was present in 30.6% of the informative cases and occurred more often in individuals of East Asian ethnicity. AI was present in 44.4% (95% confidence interval: 34.1%–54.7%) of mutant tumors compared with 25.9% (20.6%–31.2%) of wild-type tumors (p = 0.002). The shorter allele in tumors with AI in East Asian individuals was selectively amplified (shorter allele dominant) more often in mutant tumors (75.0%, 61.6%–88.4%) than in wild-type tumors (43.5%, 31.8%–55.2%, p = 0.003). In addition, there was a strong positive association between AI ratios of CA-SSR1 alleles and AI of mutant alleles.
The three polymorphisms associated with increased EGFR protein production (shorter CA-SSR1 length and variant forms of SNPs −216 and −191) were found to be rare in East Asians as compared to other ethnicities, suggesting that the cells of East Asians may make relatively less intrinsic EGFR protein. Interestingly, especially in tumors from patients of East Asian ethnicity, EGFR mutations were found to favor the shorter allele of CA-SSR1, and selective amplification of the shorter allele of CA-SSR1 occurred frequently in tumors harboring a mutation. These distinct molecular events targeting the same allele would both be predicted to result in greater EGFR protein production and/or activity. Our findings may help explain to some of the ethnic differences observed in mutational frequencies and responses to TK inhibitors.
Masaharu Nomura and colleagues examine the distribution ofEGFR polymorphisms in different populations and find differences that might explain different responses to tyrosine kinase inhibitors in lung cancer patients.
Editors' Summary
Most cases of lung cancer—the leading cause of cancer deaths worldwide—are “non-small cell lung cancer” (NSCLC), which has a very low cure rate. Recently, however, “targeted” therapies have brought new hope to patients with NSCLC. Like all cancers, NSCLC occurs when cells begin to divide uncontrollably because of changes (mutations) in their genetic material. Chemotherapy drugs treat cancer by killing these rapidly dividing cells, but, because some normal tissues are sensitive to these agents, it is hard to kill the cancer completely without causing serious side effects. Targeted therapies specifically attack the changes in cancer cells that allow them to divide uncontrollably, so it might be possible to kill the cancer cells selectively without damaging normal tissues. Epidermal growth factor receptor (EGRF) was one of the first molecules for which a targeted therapy was developed. In normal cells, messenger proteins bind to EGFR and activate its “tyrosine kinase,” an enzyme that sticks phosphate groups on tyrosine (an amino acid) in other proteins. These proteins then tell the cell to divide. Alterations to this signaling system drive the uncontrolled growth of some cancers, including NSCLC.
Why Was This Study Done?
Molecules that inhibit the tyrosine kinase activity of EGFR (for example, gefitinib) dramatically shrink some NSCLCs, particularly those in East Asian patients. Tumors shrunk by tyrosine kinase inhibitors (TKIs) often (but not always) have mutations in EGFR's tyrosine kinase. However, not all tumors with these mutations respond to TKIs, and other genetic changes—for example, amplification (multiple copies) of the EGFR gene—also affect tumor responses to TKIs. It would be useful to know which genetic changes predict these responses when planning treatments for NSCLC and to understand why the frequency of these changes varies between ethnic groups. In this study, the researchers have examined three polymorphisms—differences in DNA sequences that occur between individuals—in the EGFR gene in people with and without NSCLC. In addition, they have looked for associations between these polymorphisms, which are present in every cell of the body, and the EGFR gene mutations and allelic imbalances (genes occur in pairs but amplification or loss of one copy, or allele, often causes allelic imbalance in tumors) that occur in NSCLCs.
What Did the Researchers Do and Find?
The researchers measured how often three EGFR polymorphisms (the length of a repeat sequence called CA-SSR1, and two single nucleotide variations [SNPs])—all of which probably affect how much protein is made from the EGFR gene—occurred in normal tissue and NSCLC tissue from East Asians and individuals of European descent. They also looked for mutations in the EGFR tyrosine kinase and allelic imbalance in the tumors, and then determined which genetic variations and alterations tended to occur together in people with the same ethnicity. Among many associations, the researchers found that shorter alleles of CA-SSR1 and the minor forms of the two SNPs occurred less often in East Asians than in individuals of European descent. They also confirmed that EGFR kinase mutations were more common in NSCLCs in East Asians than in European-descent individuals. Furthermore, mutations occurred more often in tumors with allelic imbalance, and in tumors where there was allelic imbalance and an EGFR mutation, the mutant allele was amplified more often than the wild-type allele.
What Do These Findings Mean?
The researchers use these associations between gene variants and tumor-associated alterations to propose a model to explain the ethnic differences in mutational frequencies and responses to TKIs seen in NSCLC. They suggest that because of the polymorphisms in the EGFR gene commonly seen in East Asians, people from this ethnic group make less EGFR protein than people from other ethnic groups. This would explain why, if a threshold level of EGFR is needed to drive cells towards malignancy, East Asians have a high frequency of amplified EGFR tyrosine kinase mutations in their tumors—mutation followed by amplification would be needed to activate EGFR signaling. This model, though speculative, helps to explain some clinical findings, such as the frequency of EGFR mutations and of TKI sensitivity in NSCLCs in East Asians. Further studies of this type in different ethnic groups and in different tumors, as well as with other genes for which targeted therapies are available, should help oncologists provide personalized cancer therapies for their patients.
Additional Information.
Please access these Web sites via the online version of this summary at
US National Cancer Institute information on lung cancer and on cancer treatment for patients and professionals
MedlinePlus encyclopedia entries on NSCLC
Cancer Research UK information for patients about all aspects of lung cancer, including treatment with TKIs
Wikipedia pages on lung cancer, EGFR, and gefitinib (note that Wikipedia is a free online encyclopedia that anyone can edit)
PMCID: PMC1876407  PMID: 17455987
8.  Genome-wide SNP identification in multiple morphotypes of allohexaploid tall fescue (Festuca arundinacea Schreb) 
BMC Genomics  2012;13:219.
Single nucleotide polymorphisms (SNPs) provide essential tools for the advancement of research in plant genomics, and the development of SNP resources for many species has been accelerated by the capabilities of second-generation sequencing technologies. The current study aimed to develop and use a novel bioinformatic pipeline to generate a comprehensive collection of SNP markers within the agriculturally important pasture grass tall fescue; an outbreeding allopolyploid species displaying three distinct morphotypes: Continental, Mediterranean and rhizomatous.
A bioinformatic pipeline was developed that successfully identified SNPs within genotypes from distinct tall fescue morphotypes, following the sequencing of 414 polymerase chain reaction (PCR) – generated amplicons using 454 GS FLX technology. Equivalent amplicon sets were derived from representative genotypes of each morphotype, including six Continental, five Mediterranean and one rhizomatous. A total of 8,584 and 2,292 SNPs were identified with high confidence within the Continental and Mediterranean morphotypes respectively. The success of the bioinformatic approach was demonstrated through validation (at a rate of 70%) of a subset of 141 SNPs using both SNaPshot™ and GoldenGate™ assay chemistries. Furthermore, the quantitative genotyping capability of the GoldenGate™ assay revealed that approximately 30% of the putative SNPs were accessible to co-dominant scoring, despite the hexaploid genome structure. The sub-genome-specific origin of each SNP validated from Continental tall fescue was predicted using a phylogenetic approach based on comparison with orthologous sequences from predicted progenitor species.
Using the appropriate bioinformatic approach, amplicon resequencing based on 454 GS FLX technology is an effective method for the identification of polymorphic SNPs within the genomes of Continental and Mediterranean tall fescue. The GoldenGate™ assay is capable of high-throughput co-dominant SNP allele detection, and minimises the problems associated with SNP genotyping in a polyploid by effectively reducing the complexity to a diploid system. This SNP collection may now be refined and used in applications such as cultivar identification, genetic linkage map construction, genome-wide association studies and genomic selection in tall fescue. The bioinformatic pipeline described here represents an effective general method for SNP discovery within outbreeding allopolyploid species.
PMCID: PMC3444928  PMID: 22672128
Lolium arundinaceum; Molecular marker; DNA sequencing; Haplotype; Sub-genome
9.  SNP calling by sequencing pooled samples 
BMC Bioinformatics  2012;13:239.
Performing high throughput sequencing on samples pooled from different individuals is a strategy to characterize genetic variability at a small fraction of the cost required for individual sequencing. In certain circumstances some variability estimators have even lower variance than those obtained with individual sequencing. SNP calling and estimating the frequency of the minor allele from pooled samples, though, is a subtle exercise for at least three reasons. First, sequencing errors may have a much larger relevance than in individual SNP calling: while their impact in individual sequencing can be reduced by setting a restriction on a minimum number of reads per allele, this would have a strong and undesired effect in pools because it is unlikely that alleles at low frequency in the pool will be read many times. Second, the prior allele frequency for heterozygous sites in individuals is usually 0.5 (assuming one is not analyzing sequences coming from, e.g. cancer tissues), but this is not true in pools: in fact, under the standard neutral model, singletons (i.e. alleles of minimum frequency) are the most common class of variants because P(f) ∝ 1/f and they occur more often as the sample size increases. Third, an allele appearing only once in the reads from a pool does not necessarily correspond to a singleton in the set of individuals making up the pool, and vice versa, there can be more than one read – or, more likely, none – from a true singleton.
To improve upon existing theory and software packages, we have developed a Bayesian approach for minor allele frequency (MAF) computation and SNP calling in pools (and implemented it in a program called snape): the approach takes into account sequencing errors and allows users to choose different priors. We also set up a pipeline which can simulate the coalescence process giving rise to the SNPs, the pooling procedure and the sequencing. We used it to compare the performance of snape to that of other packages.
We present a software which helps in calling SNPs in pooled samples: it has good power while retaining a low false discovery rate (FDR). The method also provides the posterior probability that a SNP is segregating and the full posterior distribution of f for every SNP. In order to test the behaviour of our software, we generated (through simulated coalescence) artificial genomes and computed the effect of a pooled sequencing protocol, followed by SNP calling. In this setting, snape has better power and False Discovery Rate (FDR) than the comparable packages samtools, PoPoolation, Varscan : for N = 50 chromosomes, snape has power ≈ 35%and FDR ≈ 2.5%. snape is available at (source code and precompiled binaries).
PMCID: PMC3475117  PMID: 22992255
10.  The Diploid Genome Sequence of an Individual Human 
PLoS Biology  2007;5(10):e254.
Presented here is a genome sequence of an individual human. It was produced from ∼32 million random DNA fragments, sequenced by Sanger dideoxy technology and assembled into 4,528 scaffolds, comprising 2,810 million bases (Mb) of contiguous sequence with approximately 7.5-fold coverage for any given region. We developed a modified version of the Celera assembler to facilitate the identification and comparison of alternate alleles within this individual diploid genome. Comparison of this genome and the National Center for Biotechnology Information human reference assembly revealed more than 4.1 million DNA variants, encompassing 12.3 Mb. These variants (of which 1,288,319 were novel) included 3,213,401 single nucleotide polymorphisms (SNPs), 53,823 block substitutions (2–206 bp), 292,102 heterozygous insertion/deletion events (indels)(1–571 bp), 559,473 homozygous indels (1–82,711 bp), 90 inversions, as well as numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases. This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure. Moreover, 44% of genes were heterozygous for one or more variants. Using a novel haplotype assembly strategy, we were able to span 1.5 Gb of genome sequence in segments >200 kb, providing further precision to the diploid nature of the genome. These data depict a definitive molecular portrait of a diploid human genome that provides a starting point for future genome comparisons and enables an era of individualized genomic information.
Author Summary
We have generated an independently assembled diploid human genomic DNA sequence from both chromosomes of a single individual (J. Craig Venter). Our approach, based on whole-genome shotgun sequencing and using enhanced genome assembly strategies and software, generated an assembled genome over half of which is represented in large diploid segments (>200 kilobases), enabling study of the diploid genome. Comparison with previous reference human genome sequences, which were composites comprising multiple humans, revealed that the majority of genomic alterations are the well-studied class of variants based on single nucleotides (SNPs). However, the results also reveal that lesser-studied genomic variants, insertions and deletions, while comprising a minority (22%) of genomic variation events, actually account for almost 74% of variant nucleotides. Inclusion of insertion and deletion genetic variation into our estimates of interchromosomal difference reveals that only 99.5% similarity exists between the two chromosomal copies of an individual and that genetic variation between two individuals is as much as five times higher than previously estimated. The existence of a well-characterized diploid human genome sequence provides a starting point for future individual genome comparisons and enables the emerging era of individualized genomic information.
Comparison of the DNA sequence of an individual human from the reference sequence reveals a surprising amount of difference.
PMCID: PMC1964779  PMID: 17803354
11.  Validation of SNP Allele Frequencies Determined by Pooled Next-Generation Sequencing in Natural Populations of a Non-Model Plant Species 
PLoS ONE  2013;8(11):e80422.
Sequencing of pooled samples (Pool-Seq) using next-generation sequencing technologies has become increasingly popular, because it represents a rapid and cost-effective method to determine allele frequencies for single nucleotide polymorphisms (SNPs) in population pools. Validation of allele frequencies determined by Pool-Seq has been attempted using an individual genotyping approach, but these studies tend to use samples from existing model organism databases or DNA stores, and do not validate a realistic setup for sampling natural populations. Here we used pyrosequencing to validate allele frequencies determined by Pool-Seq in three natural populations of Arabidopsis halleri (Brassicaceae). The allele frequency estimates of the pooled population samples (consisting of 20 individual plant DNA samples) were determined after mapping Illumina reads to (i) the publicly available, high-quality reference genome of a closely related species (Arabidopsis thaliana) and (ii) our own de novo draft genome assembly of A. halleri. We then pyrosequenced nine selected SNPs using the same individuals from each population, resulting in a total of 540 samples. Our results show a highly significant and accurate relationship between pooled and individually determined allele frequencies, irrespective of the reference genome used. Allele frequencies differed on average by less than 4%. There was no tendency that either the Pool-Seq or the individual-based approach resulted in higher or lower estimates of allele frequencies. Moreover, the rather high coverage in the mapping to the two reference genomes, ranging from 55 to 284x, had no significant effect on the accuracy of the Pool-Seq. A resampling analysis showed that only very low coverage values (below 10-20x) would substantially reduce the precision of the method. We therefore conclude that a pooled re-sequencing approach is well suited for analyses of genetic variation in natural populations.
PMCID: PMC3820589  PMID: 24244686
12.  Switchgrass Genomic Diversity, Ploidy, and Evolution: Novel Insights from a Network-Based SNP Discovery Protocol 
PLoS Genetics  2013;9(1):e1003215.
Switchgrass (Panicum virgatum L.) is a perennial grass that has been designated as an herbaceous model biofuel crop for the United States of America. To facilitate accelerated breeding programs of switchgrass, we developed both an association panel and linkage populations for genome-wide association study (GWAS) and genomic selection (GS). All of the 840 individuals were then genotyped using genotyping by sequencing (GBS), generating 350 GB of sequence in total. As a highly heterozygous polyploid (tetraploid and octoploid) species lacking a reference genome, switchgrass is highly intractable with earlier methodologies of single nucleotide polymorphism (SNP) discovery. To access the genetic diversity of species like switchgrass, we developed a SNP discovery pipeline based on a network approach called the Universal Network-Enabled Analysis Kit (UNEAK). Complexities that hinder single nucleotide polymorphism discovery, such as repeats, paralogs, and sequencing errors, are easily resolved with UNEAK. Here, 1.2 million putative SNPs were discovered in a diverse collection of primarily upland, northern-adapted switchgrass populations. Further analysis of this data set revealed the fundamentally diploid nature of tetraploid switchgrass. Taking advantage of the high conservation of genome structure between switchgrass and foxtail millet (Setaria italica (L.) P. Beauv.), two parent-specific, synteny-based, ultra high-density linkage maps containing a total of 88,217 SNPs were constructed. Also, our results showed clear patterns of isolation-by-distance and isolation-by-ploidy in natural populations of switchgrass. Phylogenetic analysis supported a general south-to-north migration path of switchgrass. In addition, this analysis suggested that upland tetraploid arose from upland octoploid. All together, this study provides unparalleled insights into the diversity, genomic complexity, population structure, phylogeny, phylogeography, ploidy, and evolutionary dynamics of switchgrass.
Author Summary
Recent advances in sequencing technologies have enabled large-scale surveys of genetic diversity in model species with a wholly or partly sequenced reference genome. However, thousands of key species, which are essential for food, health, energy, and ecology, do not have reference genomes. To accelerate their breeding cycle via marker assisted selection, high-throughput genotyping is required for these valuable species, in spite of the absence of reference genomes. Based on genotyping by sequencing (GBS) technology, we developed a new single nucleotide polymorphism (SNP) discovery protocol, the Universal Network-Enabled Analysis Kit (UNEAK), which can be widely used in any species, regardless of genome complexity or the availability of a reference genome. Here we test this protocol on switchgrass, currently the prime energy crop species in the United States of America. In addition to the discovery of over a million SNPs and construction of high-density linkage maps, we provide novel insights into the genome complexity, ploidy, phylogeny, and evolution of switchgrass. This is only the beginning: we believe UNEAK offers the key to the exploration and exploitation of the genetic diversity of thousands of non-model species.
PMCID: PMC3547862  PMID: 23349638
13.  ARG-based genome-wide analysis of cacao cultivars 
BMC Bioinformatics  2012;13(Suppl 19):S17.
Ancestral recombinations graph (ARG) is a topological structure that captures the relationship between the extant genomic sequences in terms of genetic events including recombinations. IRiS is a system that estimates the ARG on sequences of individuals, at genomic scales, capturing the relationship between these individuals of the species. Recently, this system was used to estimate the ARG of the recombining X Chromosome of a collection of human populations using relatively dense, bi-allelic SNP data.
While the ARG is a natural model for capturing the inter-relationship between a single chromosome of the individuals of a species, it is not immediately apparent how the model can utilize whole-genome (across chromosomes) diploid data. Also, the sheer complexity of an ARG structure presents a challenge to graph visualization techniques. In this paper we examine the ARG reconstruction for (1) genome-wide or multiple chromosomes, (2) multi-allelic and (3) extremely sparse data. To aid in the visualization of the results of the reconstructed ARG, we additionally construct a much simplified topology, a classification tree, suggested by the ARG.
As the test case, we study the problem of extracting the relationship between populations of Theobroma cacao. The chocolate tree is an outcrossing species in the wild, due to self-incompatibility mechanisms at play. Thus a principled approach to understanding the inter-relationships between the different populations must take the shuffling of the genomic segments into account. The polymorphisms in the test data are short tandem repeats (STR) and are multi-allelic (sometimes as high as 30 distinct possible values at a locus). Each is at a genomic location that is bilaterally transmitted, hence the ARG is a natural model for this data. Another characteristic of this plant data set is that while it is genome-wide, across 10 linkage groups or chromosomes, it is very sparse, i.e., only 96 loci from a genome of approximately 400 megabases. The results are visualized both as MDS plots and as classification trees. To evaluate the accuracy of the ARG approach, we compare the results with those available in literature.
We have extended the ARG model to incorporate genome-wide (ensemble of multiple chromosomes) data in a natural way. We present a simple scheme to implement this in practice. Finally, this is the first time that a plant population data set is being studied by estimating its underlying ARG. We demonstrate an overall precision of 0.92 and an overall recall of 0.93 of the ARG-based classification, with respect to the gold standard. While we have corroborated the classification of the samples with that in literature, this opens the door to other potential studies that can be made on the ARG.
PMCID: PMC3526434  PMID: 23281769
14.  Assignment of SNP allelic configuration in polyploids using competitive allele-specific PCR: application to citrus triploid progeny 
Annals of Botany  2013;111(4):731-742.
Polyploidy is a major component of eukaryote evolution. Estimation of allele copy numbers for molecular markers has long been considered a challenge for polyploid species, while this process is essential for most genetic research. With the increasing availability and whole-genome coverage of single nucleotide polymorphism (SNP) markers, it is essential to implement a versatile SNP genotyping method to assign allelic configuration efficiently in polyploids.
This work evaluates the usefulness of the KASPar method, based on competitive allele-specific PCR, for the assignment of SNP allelic configuration. Citrus was chosen as a model because of its economic importance, the ongoing worldwide polyploidy manipulation projects for cultivar and rootstock breeding, and the increasing availability of SNP markers.
Fifteen SNP markers were successfully designed that produced clear allele signals that were in agreement with previous genotyping results at the diploid level. The analysis of DNA mixes between two haploid lines (Clementine and pummelo) at 13 different ratios revealed a very high correlation (average = 0·9796; s.d. = 0·0094) between the allele ratio and two parameters [θ angle = tan−1 (y/x) and y′ = y/(x + y)] derived from the two normalized allele signals (x and y) provided by KASPar. Separated cluster analysis and analysis of variance (ANOVA) from mixed DNA simulating triploid and tetraploid hybrids provided 99·71 % correct allelic configuration. Moreover, triploid populations arising from 2n gametes and interploid crosses were easily genotyped and provided useful genetic information. This work demonstrates that the KASPar SNP genotyping technique is an efficient way to assign heterozygous allelic configurations within polyploid populations. This method is accurate, simple and cost-effective. Moreover, it may be useful for quantitative studies, such as relative allele-specific expression analysis and bulk segregant analysis.
PMCID: PMC3605964  PMID: 23422023
SNP genotyping; competitive allele-specific PCR; KASPar; allele dosage; Citrus clementina; C. maxima; C. reticulata; polyploid; triploid; tetraploid
15.  Ancient Evolutionary Trade-Offs between Yeast Ploidy States 
PLoS Genetics  2013;9(3):e1003388.
The number of chromosome sets contained within the nucleus of eukaryotic organisms is a fundamental yet evolutionarily poorly characterized genetic variable of life. Here, we mapped the impact of ploidy on the mitotic fitness of baker's yeast and its never domesticated relative Saccharomyces paradoxus across wide swaths of their natural genotypic and phenotypic space. Surprisingly, environment-specific influences of ploidy on reproduction were found to be the rule rather than the exception. These ploidy–environment interactions were well conserved across the 2 billion generations separating the two species, suggesting that they are the products of strong selection. Previous hypotheses of generalizable advantages of haploidy or diploidy in ecological contexts imposing nutrient restriction, toxin exposure, and elevated mutational loads were rejected in favor of more fine-grained models of the interplay between ecology and ploidy. On a molecular level, cell size and mating type locus composition had equal, but limited, explanatory power, each explaining 12.5%–17% of ploidy–environment interactions. The mechanism of the cell size–based superior reproductive efficiency of haploids during Li+ exposure was traced to the Li+ exporter ENA. Removal of the Ena transporters, forcing dependence on the Nha1 extrusion system, completely altered the effects of ploidy on Li+ tolerance and evoked a strong diploid superiority, demonstrating how genetic variation at a single locus can completely reverse the relative merits of haploidy and diploidy. Taken together, our findings unmasked a dynamic interplay between ploidy and ecology that was of unpredicted evolutionary importance and had multiple molecular roots.
Author Summary
Organisms vary in the number of chromosome sets contained within the nucleus of each cell, but neither the reasons nor the consequences of this variation are well understood. We designed yeasts that differed in the number of chromosome sets but were otherwise identical and mapped the consequences of such ploidy variations during exposure to a large palette of environments. Contrary to commonly held assumptions, we found ploidy effects on the mitotic reproductive capacity of yeast to be the rule rather than the exception and to be highly evolutionarily conserved. Furthermore, our data rejected previously contemplated hypotheses of generalizable advantages of haploidy or diploidy when cells face nutrient starvation or are exposed to toxins or increased mutation rates. We also mapped the molecular processes mediating ploidy–environment interactions, showing that cell size and mating type locus composition had equal explanatory power. Finally we show that ploidy effects can be mechanistically very subtle, as a designed shift from one plasma membrane Li+ transporter to another completely altered the relative merits of having one or two chromosome sets when exposed to high Li+ concentrations. This complex and dynamic interplay between the number of chromosomes sets and the fluctuating environment must be taken into account when considering organismal form and behavior.
PMCID: PMC3605057  PMID: 23555297
16.  Detecting imbalanced expression of SNP alleles by minisequencing on microarrays 
BMC Biotechnology  2004;4:24.
Each of the human genes or transcriptional units is likely to contain single nucleotide polymorphisms that may give rise to sequence variation between individuals and tissues on the level of RNA. Based on recent studies, differential expression of the two alleles of heterozygous coding single nucleotide polymorphisms (SNPs) may be frequent for human genes. Methods with high accuracy to be used in a high throughput setting are needed for systematic surveys of expressed sequence variation. In this study we evaluated two formats of multiplexed, microarray based minisequencing for quantitative detection of imbalanced expression of SNP alleles. We used a panel of ten SNPs located in five genes known to be expressed in two endothelial cell lines as our model system.
The accuracy and sensitivity of quantitative detection of allelic imbalance was assessed for each SNP by constructing regression lines using a dilution series of mixed samples from individuals of different genotype. Accurate quantification of SNP alleles by both assay formats was evidenced for by R2 values > 0.95 for the majority of the regression lines. According to a two sample t-test, we were able to distinguish 1–9% of a minority SNP allele from a homozygous genotype, with larger variation between SNPs than between assay formats. Six of the SNPs, heterozygous in either of the two cell lines, were genotyped in RNA extracted from the endothelial cells. The coefficient of variation between the fluorescent signals from five parallel reactions was similar for cDNA and genomic DNA. The fluorescence signal intensity ratios measured in the cDNA samples were compared to those in genomic DNA to determine the relative expression levels of the two alleles of each SNP. Four of the six SNPs tested displayed a higher than 1.4-fold difference in allelic ratios between cDNA and genomic DNA. The results were verified by allele-specific oligonucleotide hybridisation and minisequencing in a microtiter plate format.
We conclude that microarray based minisequencing is an accurate and accessible tool for multiplexed screening for imbalanced allelic expression in multiple samples and tissues in parallel.
PMCID: PMC529269  PMID: 15500681
17.  HaploSNPer: a web-based allele and SNP detection tool 
BMC Genetics  2008;9:23.
Single nucleotide polymorphisms (SNPs) and small insertions or deletions (indels) are the most common type of polymorphisms and are frequently used for molecular marker development. Such markers have become very popular for all kinds of genetic analysis, including haplotype reconstruction. Haplotypes can be reconstructed for whole chromosomes but also for specific genes, based on the SNPs present. Haplotypes in the latter context represent the different alleles of a gene. The computational approach to SNP mining is becoming increasingly popular because of the continuously increasing number of sequences deposited in databases, which allows a more accurate identification of SNPs. Several software packages have been developed for SNP mining from databases. From these, QualitySNP is the only tool that combines SNP detection with the reconstruction of alleles, which results in a lower number of false positive SNPs and also works much faster than other programs. We have build a web-based SNP discovery and allele detection tool (HaploSNPer) based on QualitySNP.
HaploSNPer is a flexible web-based tool for detecting SNPs and alleles in user-specified input sequences from both diploid and polyploid species. It includes BLAST for finding homologous sequences in public EST databases, CAP3 or PHRAP for aligning them, and QualitySNP for discovering reliable allelic sequences and SNPs. All possible and reliable alleles are detected by a mathematical algorithm using potential SNP information. Reliable SNPs are then identified based on the reconstructed alleles and on sequence redundancy.
Thorough testing of HaploSNPer (and the underlying QualitySNP algorithm) has shown that EST information alone is sufficient for the identification of alleles and that reliable SNPs can be found efficiently. Furthermore, HaploSNPer supplies a user friendly interface for visualization of SNP and alleles. HaploSNPer is available from .
PMCID: PMC2288614  PMID: 18307806
18.  Identifying the genetic determinants of transcription factor activity 
Genome-wide messenger RNA expression levels are highly heritable. However, the molecular mechanisms underlying this heritability are poorly understood.The influence of trans-acting polymorphisms is often mediated by changes in the regulatory activity of one or more sequence-specific transcription factors (TFs). We use a method that exploits prior information about the DNA-binding specificity of each TF to estimate its genotype-specific regulatory activity. To this end, we perform linear regression of genotype-specific differential mRNA expression on TF-specific promoter-binding affinity.Treating inferred TF activity as a quantitative trait and mapping it across a panel of segregants from an experimental genetic cross allows us to identify trans-acting loci (‘aQTLs') whose allelic variation modulates the TF. A few of these aQTL regions contain the gene encoding the TF itself; several others contain a gene whose protein product is known to interact with the TF.Our method is strictly causal, as it only uses sequence-based features as predictors. Application to budding yeast demonstrates a dramatic increase in statistical power, compared with existing methods, to detect locus-TF associations and trans-acting loci. Our aQTL mapping strategy also succeeds in mouse.
Genetic sequence variation naturally perturbs mRNA expression levels in the cell. In recent years, analysis of parallel genotyping and expression profiling data for segregants from genetic crosses between parental strains has revealed that mRNA expression levels are highly heritable. Expression quantitative trait loci (eQTLs), whose allelic variation regulates the expression level of individual genes, have successfully been identified (Brem et al, 2002; Schadt et al, 2003). The molecular mechanisms underlying the heritability of mRNA expression are poorly understood. However, they are likely to involve mediation by transcription factors (TFs). We present a new transcription-factor-centric method that greatly increases our ability to understand what drives the genetic variation in mRNA expression (Figure 1). Our method identifies genomic loci (‘aQTLs') whose allelic variation modulates the protein-level activity of specific TFs. To map aQTLs, we integrate genotyping and expression profiling data with quantitative prior information about DNA-binding specificity of transcription factors in the form of position-specific affinity matrices (Bussemaker et al, 2007). We applied our method in two different organisms: budding yeast and mouse.
In our approach, the inferred TF activity is explicitly treated as a quantitative trait, and genetically mapped. The decrease of ‘phenotype space' from that of all genes (in the eQTL approach) to that of all TFs (in our aQTL approach) increases the statistical power to detect trans-acting loci in two distinct ways. First, as each inferred TF activity is derived from a large number of genes, it is far less noisy than mRNA levels of individual genes. Second, the number of trait/marker combinations that needs to be tested for statistical significance in parallel is roughly two orders of magnitude smaller than for eQTLs. We identified a total of 103 locus-TF associations, a more than six-fold improvement over the 17 locus-TF associations identified by several existing methods (Brem et al, 2002; Yvert et al, 2003; Lee et al, 2006; Smith and Kruglyak, 2008; Zhu et al, 2008). The total number of distinct genomic loci identified as an aQTL equals 31, which includes 11 of the 13 previously identified eQTL hotspots (Smith and Kruglyak, 2008).
To better understand the mechanisms underlying the identified genetic linkages, we examined the genes within each aQTL region. First, we found four ‘local' aQTLs, which encompass the gene encoding the TF itself. This includes the known polymorphism in the HAP1 gene (Brem et al, 2002), but also novel predictions of trans-acting polymorphisms in RFX1, STB5, and HAP4. Second, using high-throughput protein–protein interaction data, we identified putative causal genes for several aQTLs. For example, we predict that a polymorphism in the cyclin-dependent kinase CDC28 antagonistically modulates the functionally distinct cell cycle regulators Fkh1 and Fkh2. In this and other cases, our approach naturally accounts for post-translational modulation of TF activity at the protein level.
We validated our ability to predict locus-TF associations in yeast using gene expression profiles of allele replacement strains from a previous study (Smith and Kruglyak, 2008). Chromosome 15 contains an aQTL whose allelic status influences the activity of no fewer than 30 distinct TFs. This locus includes IRA2, which controls intracellular cAMP levels. We used the gene expression profile of IRA2 replacement strains to confirm that the polymorphism within IRA2 indeed modulates a subset of the TFs whose activity was predicted to link to this locus, and no other TFs.
Application of our approach to mouse data identified an aQTL modulating the activity of a specific TF in liver cells. We identified an aQTL on mouse chromosome 7 for Zscan4, a transcription factor containing four zinc finger domains and a SCAN domain. Even though we could not detect a candidate causal gene for Zscan4p because of lack of information about the mouse genome, our result demonstrates that our method also works in higher eukaryotes.
In summary, aQTL mapping has a greatly improved sensitivity to detect molecular mechanisms underlying the heritability of gene expression. The successful application of our approach to yeast and mouse data underscores the value of explicitly treating the inferred TF activity as a quantitative trait for increasing statistical power of detecting trans-acting loci. Furthermore, our method is computationally efficient, and easily applicable to any other organism whenever prior information about the DNA-binding specificity of TFs is available.
Analysis of parallel genotyping and expression profiling data has shown that mRNA expression levels are highly heritable. Currently, only a tiny fraction of this genetic variance can be mechanistically accounted for. The influence of trans-acting polymorphisms on gene expression traits is often mediated by transcription factors (TFs). We present a method that exploits prior knowledge about the in vitro DNA-binding specificity of a TF in order to map the loci (‘aQTLs') whose inheritance modulates its protein-level regulatory activity. Genome-wide regression of differential mRNA expression on predicted promoter affinity is used to estimate segregant-specific TF activity, which is subsequently mapped as a quantitative phenotype. In budding yeast, our method identifies six times as many locus-TF associations and more than twice as many trans-acting loci as all existing methods combined. Application to mouse data from an F2 intercross identified an aQTL on chromosome VII modulating the activity of Zscan4 in liver cells. Our method has greatly improved statistical power over existing methods, is mechanism based, strictly causal, computationally efficient, and generally applicable.
PMCID: PMC2964119  PMID: 20865005
gene expression; gene regulatory networks; genetic variation; quantitative trait loci; transcription factors
19.  Computational Analysis of Whole-Genome Differential Allelic Expression Data in Human 
PLoS Computational Biology  2010;6(7):e1000849.
Allelic imbalance (AI) is a phenomenon where the two alleles of a given gene are expressed at different levels in a given cell, either because of epigenetic inactivation of one of the two alleles, or because of genetic variation in regulatory regions. Recently, Bing et al. have described the use of genotyping arrays to assay AI at a high resolution (∼750,000 SNPs across the autosomes). In this paper, we investigate computational approaches to analyze this data and identify genomic regions with AI in an unbiased and robust statistical manner. We propose two families of approaches: (i) a statistical approach based on z-score computations, and (ii) a family of machine learning approaches based on Hidden Markov Models. Each method is evaluated using previously published experimental data sets as well as with permutation testing. When applied to whole genome data from 53 HapMap samples, our approaches reveal that allelic imbalance is widespread (most expressed genes show evidence of AI in at least one of our 53 samples) and that most AI regions in a given individual are also found in at least a few other individuals. While many AI regions identified in the genome correspond to known protein-coding transcripts, others overlap with recently discovered long non-coding RNAs. We also observe that genomic regions with AI not only include complete transcripts with consistent differential expression levels, but also more complex patterns of allelic expression such as alternative promoters and alternative 3′ end. The approaches developed not only shed light on the incidence and mechanisms of allelic expression, but will also help towards mapping the genetic causes of allelic expression and identify cases where this variation may be linked to diseases.
Author Summary
Measures of gene expression, and the search for regulatory regions in the genome responsible for differences in levels of gene expression, is one of the key paths of research used to identify disease causing genes, as well as explain differences between healthy individuals. Typically, experiments have measured and compared gene expression in multiple individuals, and used this information to attempt to map regulatory regions responsible. Differences in environment between individuals can, however, cause differences in gene expression unrelated to the underlying regulatory sequence. New genotyping technologies enable the measurement of expression of both copies of a particular gene, at loci that are heterozygous within a particular individual. This will therefore act as an internal control, as environmental factors will continue to affect the expression of both copies of a gene at presumably equal levels, and differences in expression are more likely to be explicable by differences in regulatory regions specific to the two copies of the gene itself. Differences between regulatory regions are expected to lead to differences in expression of the two copies (or the two alleles) of a particular gene, also known as allelic imbalance. We describe a set of signal processing methods for the reliable detection of allelic expression within the genome.
PMCID: PMC2900287  PMID: 20628616
20.  Regional replication of association with refractive error on 15q14 and 15q25 in the Age-Related Eye Disease Study cohort 
Molecular Vision  2013;19:2173-2186.
Refractive error is a complex trait with multiple genetic and environmental risk factors, and is the most common cause of preventable blindness worldwide. The common nature of the trait suggests the presence of many genetic factors that individually may have modest effects. To achieve an adequate sample size to detect these common variants, large, international collaborations have formed. These consortia typically use meta-analysis to combine multiple studies from many different populations. This approach is robust to differences between populations; however, it does not compensate for the different haplotypes in each genetic background evidenced by different alleles in linkage disequilibrium with the causative variant. We used the Age-Related Eye Disease Study (AREDS) cohort to replicate published significant associations at two loci on chromosome 15 from two genome-wide association studies (GWASs). The single nucleotide polymorphisms (SNPs) that exhibited association on chromosome 15 in the original studies did not show evidence of association with refractive error in the AREDS cohort. This paper seeks to determine whether the non-replication in this AREDS sample may be due to the limited number of SNPs chosen for replication.
We selected all SNPs genotyped on the Illumina Omni2.5v1_B array or custom TaqMan assays or imputed from the GWAS data, in the region surrounding the SNPs from the Consortium for Refractive Error and Myopia study. We analyzed the SNPs for association with refractive error using standard regression methods in PLINK. The effective number of tests was calculated using the Genetic Type I Error Calculator.
Although use of the same SNPs used in the Consortium for Refractive Error and Myopia study did not show any evidence of association with refractive error in this AREDS sample, other SNPs within the candidate regions demonstrated an association with refractive error. Significant evidence of association was found using the hyperopia categorical trait, with the most significant SNPs rs1357179 on 15q14 (p=1.69×10−3) and rs7164400 on 15q25 (p=8.39×10−4), which passed the replication thresholds.
This study adds to the growing body of evidence that attempting to replicate the most significant SNPs found in one population may not be significant in another population due to differences in the linkage disequilibrium structure and/or allele frequency. This suggests that replication studies should include less significant SNPs in an associated region rather than only a few selected SNPs chosen by a significance threshold.
PMCID: PMC3826323  PMID: 24227913
21.  HIBAG—HLA genotype imputation with attribute bagging 
The Pharmacogenomics Journal  2013;14(2):192-200.
Genotyping of classical human leukocyte antigen (HLA) alleles is an essential tool in the analysis of diseases and adverse drug reactions with associations mapping to the major histocompatibility complex (MHC). However, deriving high-resolution HLA types subsequent to whole-genome single-nucleotide polymorphism (SNP) typing or sequencing is often cost prohibitive for large samples. An alternative approach takes advantage of the extended haplotype structure within the MHC to predict HLA alleles using dense SNP genotypes, such as those available from genome-wide SNP panels. Current methods for HLA imputation are difficult to apply or may require the user to have access to large training data sets with SNP and HLA types. We propose HIBAG, HLA Imputation using attribute BAGging, that makes predictions by averaging HLA-type posterior probabilities over an ensemble of classifiers built on bootstrap samples. We assess the performance of HIBAG using our study data (n=2668 subjects of European ancestry) as a training set and HLA data from the British 1958 birth cohort study (n≈1000 subjects) as independent validation samples. Prediction accuracies for HLA-A, B, C, DRB1 and DQB1 range from 92.2% to 98.1% using a set of SNP markers common to the Illumina 1M Duo, OmniQuad, OmniExpress, 660K and 550K platforms. HIBAG performed well compared with the other two leading methods, HLA*IMP and BEAGLE. This method is implemented in a freely available HIBAG R package that includes pre-fit classifiers for European, Asian, Hispanic and African ancestries, providing a readily available imputation approach without the need to have access to large training data sets.
PMCID: PMC3772955  PMID: 23712092
22.  Development and Evaluation of a Genome-Wide 6K SNP Array for Diploid Sweet Cherry and Tetraploid Sour Cherry 
PLoS ONE  2012;7(12):e48305.
High-throughput genome scans are important tools for genetic studies and breeding applications. Here, a 6K SNP array for use with the Illumina Infinium® system was developed for diploid sweet cherry (Prunus avium) and allotetraploid sour cherry (P. cerasus). This effort was led by RosBREED, a community initiative to enable marker-assisted breeding for rosaceous crops. Next-generation sequencing in diverse breeding germplasm provided 25 billion basepairs (Gb) of cherry DNA sequence from which were identified genome-wide SNPs for sweet cherry and for the two sour cherry subgenomes derived from sweet cherry (avium subgenome) and P. fruticosa (fruticosa subgenome). Anchoring to the peach genome sequence, recently released by the International Peach Genome Initiative, predicted relative physical locations of the 1.9 million putative SNPs detected, preliminarily filtered to 368,943 SNPs. Further filtering was guided by results of a 144-SNP subset examined with the Illumina GoldenGate® assay on 160 accessions. A 6K Infinium® II array was designed with SNPs evenly spaced genetically across the sweet and sour cherry genomes. SNPs were developed for each sour cherry subgenome by using minor allele frequency in the sour cherry detection panel to enrich for subgenome-specific SNPs followed by targeting to either subgenome according to alleles observed in sweet cherry. The array was evaluated using panels of sweet (n = 269) and sour (n = 330) cherry breeding germplasm. Approximately one third of array SNPs were informative for each crop. A total of 1825 polymorphic SNPs were verified in sweet cherry, 13% of these originally developed for sour cherry. Allele dosage was resolved for 2058 polymorphic SNPs in sour cherry, one third of these being originally developed for sweet cherry. This publicly available genomics resource represents a significant advance in cherry genome-scanning capability that will accelerate marker-locus-trait association discovery, genome structure investigation, and genetic diversity assessment in this diploid-tetraploid crop group.
PMCID: PMC3527432  PMID: 23284615
23.  Haplotype Mapping of a Diploid Non-Meiotic Organism Using Existing and Induced Aneuploidies 
PLoS Genetics  2008;4(1):e1.
Haplotype maps (HapMaps) reveal underlying sequence variation and facilitate the study of recombination and genetic diversity. In general, HapMaps are produced by analysis of Single-Nucleotide Polymorphism (SNP) segregation in large numbers of meiotic progeny. Candida albicans, the most common human fungal pathogen, is an obligate diploid that does not appear to undergo meiosis. Thus, standard methods for haplotype mapping cannot be used. We exploited naturally occurring aneuploid strains to determine the haplotypes of the eight chromosome pairs in the C. albicans laboratory strain SC5314 and in a clinical isolate. Comparison of the maps revealed that the clinical strain had undergone a significant amount of genome rearrangement, consisting primarily of crossover or gene conversion recombination events. SNP map haplotyping revealed that insertion and activation of the UAU1 cassette in essential and non-essential genes can result in whole chromosome aneuploidy. UAU1 is often used to construct homozygous deletions of targeted genes in C. albicans; the exact mechanism (trisomy followed by chromosome loss versus gene conversion) has not been determined. UAU1 insertion into the essential ORC1 gene resulted in a large proportion of trisomic strains, while gene conversion events predominated when UAU1 was inserted into the non-essential LRO1 gene. Therefore, induced aneuploidies can be used to generate HapMaps, which are essential for analyzing genome alterations and mitotic recombination events in this clonal organism.
Author Summary
Candida albicans, a heterozygous diploid yeast, is the most prevalent fungal pathogen. It often acquires resistance to anti-fungal drugs via genome-altering recombination events. In many organisms, recombination events are analyzed using Haplotype Maps (HapMaps), which show the location of different alleles on each chromosomal homolog. Conventional HapMaps are constructed by following allelic markers as they segregate in meiotic progeny. Because C. albicans has not been shown to undergo meiosis, construction of a Candida HapMap has not been possible. We exploited the presence of whole chromosome aneuploidies in mitotic progeny of C. albicans to detect skewed ratios of different alleles, thereby determining the relationships between these alleles on each chromosomal homolog. This facilitated the construction of a HapMap for the most commonly used C. albicans laboratory strain. We then used this HapMap to identify all of the recombination events in a clinical isolate relative to the laboratory reference strain. Finally, we used this mapping approach to investigate the molecular mechanisms that affect the C. albicans genome when it is subjected to a common gene disruption technique. Our rapid HapMap construction method is generally applicable to any organism for which whole-chromosome aneuploidy events can be identified.
PMCID: PMC2174976  PMID: 18179283
24.  Divergence of nucleosome positioning between two closely related yeast species: genetic basis and functional consequences 
Inter-species hybrids can be used to dissect the relative contribution of cis and trans effects to the evolution of nucleosome positioning. Most (∼70%) differences in nucleosome positioning between two closely related yeast species are due to cis effects.Cis effects are primarily due to divergence of AT-rich nucleosome-disfavoring sequences, but are not associated with divergence of nucleosome-favoring sequences.Differences in nucleosome positioning propagate to multiple adjacent nucleosomes, supporting the statistical positioning hypothesis.Divergence of nucleosome positioning is excluded from regulatory elements and is not correlated with gene expression divergence, suggesting a neutral mode of evolution.
Phenotypic diversity is often due to changes in gene regulation, and recent studies have characterized extensive differences between the gene expression programs of closely related species (Khaitovich et al, 2006; Tirosh et al, 2009). However, very little is known about the mechanisms that drive this divergence. Here, we analyze the evolution of nucleosome positioning, by comparing the patterns of nucleosomes between two yeast species, as well as generating the allele-specific nucleosome profile in their hybrid. We ask two main questions: (1) what is the genetic basis of inter-species differences in nucleosome positioning? and (2) what is the regulatory function of these differences?
Generally speaking, we can classify the genetic basis of the divergence in nucleosome positioning into two mechanisms. First, mutations in the local DNA sequence may influence the ability to bind nucleosomes at this region; we refer to these as cis effects. Second, mutations may affect the activity of various proteins that alter nucleosome positioning either actively (e.g. chromatin-remodeling enzymes) or by simply competing with nucleosomes for binding to the same DNA sequence (e.g. transcription factors); we refer to these as trans effects.
To classify the observed inter-species differences into cis versus trans effects, we measured allele-specific nucleosome positions within the inter-specific hybrid of the two species (Wittkopp et al, 2004; Tirosh et al, 2009). The hybrid contains the alleles of both species; hence, cis effects, which involve mutations that discriminate between the two alleles, will be maintained in the hybrid so that nucleosome positioning will be different between the alleles coming from the different species. Trans effects, in contrast, will not discriminate between the two hybrid alleles from the different species, as these two alleles reside together at the same trans environment (hybrid nucleus) and are thus regulated by the same set of proteins—the combination of proteins from the two species. Using this approach, we found that ∼70% of the inter-species differences in nucleosome positioning are due to cis effects, whereas the rest is due to trans effects.
The local DNA sequence is indeed known to affect nucleosome positions, and many features of DNA sequences were proposed to influence nucleosome binding, either by rejecting nucleosomes, or by being favorable for nucleosome binding (Segal et al, 2006; Lee et al, 2007; Kaplan et al, 2009). We find, however, that nucleosome positions diverged primarily through changes in AT-rich sequences, which exclude nucleosomes, whereas mutations in sequences that correlate with high-nucleosome occupancy do not influence inter-species divergence.
Nucleosomes restrict the access of proteins to the DNA and may thus affect DNA-related processes such as transcription, recombination or replication. Indeed, promoters and regulatory sequences are often depleted of nucleosomes, and highly transcribed genes are associated with low occupancy of nucleosomes at their promoters (Lee et al, 2007). Several earlier studies also suggested that evolutionary divergence of gene expression is driven by changes in chromatin structure (Lee et al, 2006; Choi and Kim, 2008; Tirosh et al, 2008; Field et al, 2009). However, we find that nucleosome positions (or occupancy) at regulatory elements are largely conserved, and furthermore, that the inter-species differences in nucleosome positions do not correlate with gene expression differences. These results suggest that nucleosome positioning is not a central mechanism for evolutionary changes in gene regulation and that most of the observed changes may be due to neutral drift.
Does the apparent low influence of nucleosome positioning on gene expression divergence implies that nucleosome positions do not have a function in gene regulation? To address this, we examined two additional modes of gene regulation: transcriptional response to changes in growth conditions (glucose versus glycerol media), and the expression differences between different cell types (haploid versus diploid cells). Consistent with earlier studies, we found that the response to growth conditions is significantly, albeit weakly, associated with changes in nucleosome positioning. Interestingly, we also found a strikingly strong association between gene expression and nucleosomal changes in the two cell types. Taken together, these results suggest that nucleosome positioning is used preferentially for biological processes in which genes are turned on and off (e.g. different cell type), but less so during divergence of closely related species in which gradual changes accumulate over time.
Gene regulation differs greatly between related species, constituting a major source of phenotypic diversity. Recent studies characterized extensive differences in the gene expression programs of closely related species. In contrast, virtually nothing is known about the evolution of chromatin structure and how it influences the divergence of gene expression. Here, we compare the genome-wide nucleosome positioning of two closely related yeast species and, by profiling their inter-specific hybrid, trace the genetic basis of the observed differences into mutations affecting the local DNA sequences (cis effects) or the upstream regulators (trans effects). The majority (∼70%) of inter-species differences is due to cis effects, leaving a significant contribution (30%) for trans factors. We show that cis effects are well explained by mutations in nucleosome-disfavoring AT-rich sequences, but are not associated with divergence of nucleosome-favoring sequences. Differences in nucleosome positioning propagate to multiple adjacent nucleosomes, supporting the statistical positioning hypothesis, and we provide evidence that nucleosome-free regions, but not the +1 nucleosome, serve as stable border elements. Surprisingly, although we find that differential nucleosome positioning among cell types is strongly correlated with differential expression, this does not seem to be the case for evolutionary changes: divergence of nucleosome positioning is excluded from regulatory elements and is not correlated with gene expression divergence, suggesting a primarily neutral mode of evolution. Our results provide evolutionary insights to the genetic determinants and regulatory function of nucleosome positioning.
PMCID: PMC2890324  PMID: 20461072
evolution; gene regulation; nucleosome positioning
25.  A 48 SNP set for grapevine cultivar identification 
BMC Plant Biology  2011;11:153.
Rapid and consistent genotyping is an important requirement for cultivar identification in many crop species. Among them grapevine cultivars have been the subject of multiple studies given the large number of synonyms and homonyms generated during many centuries of vegetative multiplication and exchange. Simple sequence repeat (SSR) markers have been preferred until now because of their high level of polymorphism, their codominant nature and their high profile repeatability. However, the rapid application of partial or complete genome sequencing approaches is identifying thousands of single nucleotide polymorphisms (SNP) that can be very useful for such purposes. Although SNP markers are bi-allelic, and therefore not as polymorphic as microsatellites, the high number of loci that can be multiplexed and the possibilities of automation as well as their highly repeatable results under any analytical procedure make them the future markers of choice for any type of genetic identification.
We analyzed over 300 SNP in the genome of grapevine using a re-sequencing strategy in a selection of 11 genotypes. Among the identified polymorphisms, we selected 48 SNP spread across all grapevine chromosomes with allele frequencies balanced enough as to provide sufficient information content for genetic identification in grapevine allowing for good genotyping success rate. Marker stability was tested in repeated analyses of a selected group of cultivars obtained worldwide to demonstrate their usefulness in genetic identification.
We have selected a set of 48 stable SNP markers with a high discrimination power and a uniform genome distribution (2-3 markers/chromosome), which is proposed as a standard set for grapevine (Vitis vinifera L.) genotyping. Any previous problems derived from microsatellite allele confusion between labs or the need to run reference cultivars to identify allele sizes disappear using this type of marker. Furthermore, because SNP markers are bi-allelic, allele identification and genotype naming are extremely simple and genotypes obtained with different equipments and by different laboratories are always fully comparable.
PMCID: PMC3221639  PMID: 22060012

Results 1-25 (1106229)