Sequence-based variation in gene expression is a key driver of disease risk. Common variants regulating expression in cis have been mapped in many eQTL studies typically in single tissues from unrelated individuals. Here, we present a comprehensive analysis of gene expression across multiple tissues conducted in a large set of mono- and dizygotic twins that allows systematic dissection of genetic (cis and trans) and non-genetic effects on gene expression. Using identity-by-descent estimates, we show that at least 40% of the total heritable cis-effect on expression cannot be accounted for by common cis-variants, a finding which exposes the contribution of low frequency and rare regulatory variants with respect to both transcriptional regulation and complex trait susceptibility. We show that a substantial proportion of gene expression heritability is trans to the structural gene and identify several replicating trans-variants which act predominantly in a tissue-restricted manner and may regulate the transcription of many genes.
The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and in their final output (composition of assembled sequence). More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. The Assemblathon competitions are intended to assess current state-of-the-art methods in genome assembly.
In Assemblathon 2, we provided a variety of sequence data to be assembled for three vertebrate species (a bird, a fish, and snake). This resulted in a total of 43 submitted assemblies from 21 participating teams. We evaluated these assemblies using a combination of optical map data, Fosmid sequences, and several statistical methods. From over 100 different metrics, we chose ten key measures by which to assess the overall quality of the assemblies.
Many current genome assemblers produced useful assemblies, containing a significant representation of their genes and overall genome structure. However, the high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly and that approaches which work well in assembling the genome of one species may not necessarily work well for another.
Genome assembly; N50; Scaffolds; Assessment; Heterozygosity; COMPASS
All non-human great apes are endangered in the wild, and it is therefore important to gain an understanding of their demography and genetic diversity. Whole genome assembly projects have provided an invaluable foundation for understanding genetics in all four genera, but to date genetic studies of multiple individuals within great ape species have largely been confined to mitochondrial DNA and a small number of other loci. Here, we present a genome-wide survey of genetic variation in gorillas using a reduced representation sequencing approach, focusing on the two lowland subspecies. We identify 3,006,670 polymorphic sites in 14 individuals: 12 western lowland gorillas (Gorilla gorilla gorilla) and 2 eastern lowland gorillas (Gorilla beringei graueri). We find that the two species are genetically distinct, based on levels of heterozygosity and patterns of allele sharing. Focusing on the western lowland population, we observe evidence for population substructure, and a deficit of rare genetic variants suggesting a recent episode of population contraction. In western lowland gorillas, there is an elevation of variation towards telomeres and centromeres on the chromosomal scale. On a finer scale, we find substantial variation in genetic diversity, including a marked reduction close to the major histocompatibility locus, perhaps indicative of recent strong selection there. These findings suggest that despite their maintaining an overall level of genetic diversity equal to or greater than that of humans, population decline, perhaps associated with disease, has been a significant factor in recent and long-term pressures on wild gorilla populations.
We present PEER (probabilistic estimation of expression residuals), a software package implementing statistical models that improve the sensitivity and interpretability of genetic associations in population-scale expression data. This approach builds on factor analysis methods that infer broad variance components in the measurements. PEER takes as input transcript profiles and covariates from a set of individuals, and then outputs hidden factors that explain much of the expression variability. Optionally, these factors can be interpreted as pathway or transcription factor activations by providing prior information about which genes are involved in the pathway or targeted by the factor. The inferred factors are used in genetic association analyses. First, they are treated as additional covariates, and are included in the model to increase detection power for mapping expression traits. Second, they are analyzed as phenotypes themselves to understand the causes of global expression variability. PEER extends previous related surrogate variable models and can be implemented within hours on a desktop computer.
Accurate knowledge of haplotypes, the combination of alleles co-residing on a single copy of a chromosome, enables powerful gene mapping and sequence imputation methods. Since humans are diploid, haplotypes must be derived from genotypes by a phasing process. In this study, we present a new computational model for haplotype phasing based on pairwise sharing of haplotypes inferred to be Identical-By-Descent (IBD). We apply the Bayesian network based model in a new phasing algorithm, called systematic long-range phasing (SLRP), that can capitalize on the close genetic relationships in isolated founder populations, and show with simulated and real genome-wide genotype data that SLRP substantially reduces the rate of phasing errors compared to previous phasing algorithms. Furthermore, the method accurately identifies regions of IBD, enabling linkage-like studies without pedigrees, and can be used to impute most genotypes with very low error rate.
haplotype; population isolate; long-range phasing; Bayesian network
Small RNAs are functional molecules that modulate mRNA transcripts and have been implicated in the aetiology of several common diseases. However, little is known about the extent of their variability within the human population. Here, we characterise the extent, causes, and effects of naturally occurring variation in expression and sequence of small RNAs from adipose tissue in relation to genotype, gene expression, and metabolic traits in the MuTHER reference cohort. We profiled the expression of 15 to 30 base pair RNA molecules in subcutaneous adipose tissue from 131 individuals using high-throughput sequencing, and quantified levels of 591 microRNAs and small nucleolar RNAs. We identified three genetic variants and three RNA editing events. Highly expressed small RNAs are more conserved within mammals than average, as are those with highly variable expression. We identified 14 genetic loci significantly associated with nearby small RNA expression levels, seven of which also regulate an mRNA transcript level in the same region. In addition, these loci are enriched for variants significant in genome-wide association studies for body mass index. Contrary to expectation, we found no evidence for negative correlation between expression level of a microRNA and its target mRNAs. Trunk fat mass, body mass index, and fasting insulin were associated with more than twenty small RNA expression levels each, while fasting glucose had no significant associations. This study highlights the similar genetic complexity and shared genetic control of small RNA and mRNA transcripts, and gives a quantitative picture of small RNA expression variation in the human population.
Genetic information is transmitted to the cell only through RNA molecules. A special class of RNAs is comprised of the small (up to 30 nucleotide) ones, known to be potent regulators of various cellular processes. At the same time, they have not been as widely studied as messenger RNAs—we do not know how much variation in their sequence and expression level occurs naturally in human populations or how this variability influences other traits. We measured small RNA levels and genetic variability in fat tissue from 131 individuals by high-throughput sequencing. We could associate the expression levels with genetic background of the individuals, as well as changes in metabolic traits. Surprisingly, we found no large scale influence of small RNA variation on mRNA levels, their main regulatory target. Overall, our study is the first to give a quantitative picture of the naturally occurring variation in these important regulatory molecules in human fat tissue.
Adenosine-to-inosine (A-to-I) editing is a site-selective post-transcriptional alteration of double-stranded RNA by ADAR deaminases that is crucial for homeostasis and development. Recently the Mouse Genomes Project generated genome sequences for 17 laboratory mouse strains and rich catalogues of variants. We also generated RNA-seq data from whole brain RNA from 15 of the sequenced strains.
Here we present a computational approach that takes an initial set of transcriptome/genome mismatch sites and filters these calls taking into account systematic biases in alignment, single nucleotide variant calling, and sequencing depth to identify RNA editing sites with high accuracy. We applied this approach to our panel of mouse strain transcriptomes identifying 7,389 editing sites with an estimated false-discovery rate of between 2.9 and 10.5%. The overwhelming majority of these edits were of the A-to-I type, with less than 2.4% not of this class, and only three of these edits could not be explained as alignment artifacts. We validated 24 novel RNA editing sites in coding sequence, including two non-synonymous edits in the Cacna1d gene that fell into the IQ domain portion of the Cav1.2 voltage-gated calcium channel, indicating a potential role for editing in the generation of transcript diversity.
We show that despite over two million years of evolutionary divergence, the sites edited and the level of editing at each site is remarkably consistent across the 15 strains. In the Cds2 gene we find evidence for RNA editing acting to preserve the ancestral transcript sequence despite genomic sequence divergence.
The genetic basis of gene expression variation has long been studied with the aim to understand the landscape of regulatory variants, but also more recently to assist in the interpretation and elucidation of disease signals. To date, many studies have looked in specific tissues and population-based samples, but there has been limited assessment of the degree of inter-population variability in regulatory variation. We analyzed genome-wide gene expression in lymphoblastoid cell lines from a total of 726 individuals from 8 global populations from the HapMap3 project and correlated gene expression levels with HapMap3 SNPs located in cis to the genes. We describe the influence of ancestry on gene expression levels within and between these diverse human populations and uncover a non-negligible impact on global patterns of gene expression. We further dissect the specific functional pathways differentiated between populations. We also identify 5,691 expression quantitative trait loci (eQTLs) after controlling for both non-genetic factors and population admixture and observe that half of the cis-eQTLs are replicated in one or more of the populations. We highlight patterns of eQTL-sharing between populations, which are partially determined by population genetic relatedness, and discover significant sharing of eQTL effects between Asians, European-admixed, and African subpopulations. Specifically, we observe that both the effect size and the direction of effect for eQTLs are highly conserved across populations. We observe an increasing proximity of eQTLs toward the transcription start site as sharing of eQTLs among populations increases, highlighting that variants close to TSS have stronger effects and therefore are more likely to be detected across a wider panel of populations. Together these results offer a unique picture and resource of the degree of differentiation among human populations in functional regulatory variation and provide an estimate for the transferability of complex trait variants across populations.
Variation among individuals in the degree to which genes are expressed (i.e. turned on or off) is a characteristic exhibited by all species, and studies have identified regions of the genome harboring genetic variation affecting gene expression levels. To assess the degree of human inter-population variability in regulatory variation, we describe mapping of regions of the genome that have functional effects on gene expression levels. We analyzed genome-wide gene expression in human cell lines derived from 726 unrelated individuals representing 8 global populations that have been genetically well-characterized by the International HapMap Project. We describe the influence of ancestry on gene expression levels within and between these diverse human populations and uncover a non-negligible impact on global patterns of gene expression. We identify ∼5,700 genes whose expression levels are associated with genetic variation located physically close to the gene, and we observe significant sharing of associations that is partially dependent on population genetic relatedness, among Asians, European-admixed, and African subpopulations. We identify biological functions affected by regulatory variation and describe common and unique characteristics of population-specific and population-shared associations. These results offer a unique picture and resource of the degree of differentiation among human populations in functional regulatory variation.
We report genome sequences of 17 inbred strains of laboratory mice and identify almost ten times more variants than previously known. We use these genomes to explore the phylogenetic history of the laboratory mouse and to examine the functional consequences of allele-specific variation on transcript abundance, revealing that at least 12% of transcripts show a significant tissue-specific expression bias. By identifying candidate functional variants at 718 quantitative trait loci we show that the molecular nature of functional variants and their position relative to genes vary according to the effect size of the locus. These sequences provide a starting point for a new era in the functional analysis of a key model organism.
The history of human population size is important to understanding human evolution. Various studies1-5 have found evidence for a founder event (bottleneck) in East Asian and European populations associated with the human dispersal out-of-Africa event around 60 thousand years ago (kya) before present. However, these studies have to assume simplified demographic models with few parameters and do not precisely date the start and stop times of the bottleneck. Here, with fewer assumptions on population size changes, we present a more detailed history of human population sizes between approximately ten thousand to a million years ago, using the pairwise sequentially Markovian coalescent (PSMC) model applied to the complete diploid genome sequences of a Chinese male (YH)6, a Korean male (SJK)7, three European individuals (Venter8, NA12891 and NA128789) and two Yoruba males (NA1850710 and NA19239). We infer that European and Chinese populations had very similar population size histories before 10–20kya. Both populations experienced a severe bottleneck between 10–60kya while African populations experienced a milder bottleneck from which they recovered earlier. All three populations have an elevated effective population size between 60–250kya, possibly due to a population structure11. We also infer that the differentiation of genetically modern humans may have started as early as 100–120kya12, but considerable genetic exchanges may still have occurred until 20–40kya.
WormBase (www.wormbase.org) has been serving the scientific community for over 11 years as the central repository for genomic and genetic information for the soil nematode Caenorhabditis elegans. The resource has evolved from its beginnings as a database housing the genomic sequence and genetic and physical maps of a single species, and now represents the breadth and diversity of nematode research, currently serving genome sequence and annotation for around 20 nematodes. In this article, we focus on WormBase’s role of genome sequence annotation, describing how we annotate and integrate data from a growing collection of nematode species and strains. We also review our approaches to sequence curation, and discuss the impact on annotation quality of large functional genomics projects such as modENCODE.
Caenorhabditis elegans; annotation; community resource; genome; model organism database; nematode; parasitic nematode; sequence curation
The Ensembl project (http://www.ensembl.org) provides genome resources for chordate genomes with a particular focus on human genome data as well as data for key model organisms such as mouse, rat and zebrafish. Five additional species were added in the last year including gibbon (Nomascus leucogenys) and Tasmanian devil (Sarcophilus harrisii) bringing the total number of supported species to 61 as of Ensembl release 64 (September 2011). Of these, 55 species appear on the main Ensembl website and six species are provided on the Ensembl preview site (Pre!Ensembl; http://pre.ensembl.org) with preliminary support. The past year has also seen improvements across the project.
Since its release in 2000, WormBase (http://www.wormbase.org) has grown from a small resource focusing on a single species and serving a dedicated research community, to one now spanning 15 species essential to the broader biomedical and agricultural research fields. To enhance the rate of curation, we have automated the identification of key data in the scientific literature and use similar methodology for data extraction. To ease access to the data, we are collaborating with journals to link entities in research publications to their report pages at WormBase. To facilitate discovery, we have added new views of the data, integrated large-scale datasets and expanded descriptions of models for human disease. Finally, we have introduced a dramatic overhaul of the WormBase website for public beta testing. Designed to balance complexity and usability, the new site is species-agnostic, highly customizable, and interactive. Casual users and developers alike will be able to leverage the public RESTful application programming interface (API) to generate custom data mining solutions and extensions to the site. We report on the growth of our database and on our work in keeping pace with the growing demand for data, efforts to anticipate the requirements of users and new collaborations with the larger science community.
A fundamental goal in biology is to achieve a mechanistic understanding of how and to what extent ecological variation imposes selection for distinct traits and favors the fixation of specific genetic variants. Key to such an understanding is the detailed mapping of the natural genomic and phenomic space and a bridging of the gap that separates these worlds. Here we chart a high-resolution map of natural trait variation in one of the most important genetic model organisms, the budding yeast Saccharomyces cerevisiae, and its closest wild relatives and trace the genetic basis and timing of major phenotype changing events in its recent history. We show that natural trait variation in S. cerevisiae exceeds that of its relatives, despite limited genetic variation, and follows the population history rather than the source environment. In particular, the West African population is phenotypically unique, with an extreme abundance of low-performance alleles, notably a premature translational termination signal in GAL3 that cause inability to utilize galactose. Our observations suggest that many S. cerevisiae traits may be the consequence of genetic drift rather than selection, in line with the assumption that natural yeast lineages are remnants of recent population bottlenecks. Disconcertingly, the universal type strain S288C was found to be highly atypical, highlighting the danger of extrapolating gene-trait connections obtained in mosaic, lab-domesticated lineages to the species as a whole. Overall, this study represents a step towards an in-depth understanding of the causal relationship between co-variation in ecology, selection pressure, natural traits, molecular mechanism, and alleles in a key model organism.
An overall aim in modern biology is to achieve an in-depth understanding of an organism's physiology in the context of its ecology and historic selective pressures that have been acting on its genome. The baker's yeast, Saccharomyces cerevisiae, has a peculiar life history completely dominated by clonal reproduction and self-fertilization, prompting the suggestion that natural yeasts are remnants of repeated population bottlenecks in essentially clonal lineages. Such a life history dominated by mitotic proliferation purports a strong evolutionary influence of genetic drift and predicts trait variation to be high and largely defined by the genetic history of each population. Here we chart a highly resolved map of natural trait variation in S. cerevisiae and its closest non-domesticated relative, Saccharomyces paradoxus, and confirm this prediction. We found that trait variation in budding yeast is indeed high and largely defined by population rather than source environment. In particular, the West African population was found to be phenotypically unique with an extreme abundance of low-performance alleles. Our findings support the idea of population bottlenecks in the recent yeast evolutionary history and a large influence of genetic drift.
Summary: The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API.
Despite our rapidly growing knowledge about the human genome, we do not know all of the genes required for some of the most basic functions of life. To start to fill this gap we developed a high-throughput phenotypic screening platform combining potent gene silencing by RNA interference, time-lapse microscopy and computational image processing. We carried out a genome-wide phenotypic profiling of each of the ~21,000 human protein-coding genes by two-day live imaging of fluorescently labelled chromosomes. Phenotypes were scored quantitatively by computational image processing, which allowed us to identify hundreds of human genes involved in diverse biological functions including cell division, migration and survival. As part of the Mitocheck consortium, this study provides an in-depth analysis of cell division phenotypes and makes the entire high-content data set available as a resource to the community.
While there have been studies exploring regulatory variation in one or more tissues, the complexity of tissue-specificity in multiple primary tissues is not yet well understood. We explore in depth the role of cis-regulatory variation in three human tissues: lymphoblastoid cell lines (LCL), skin, and fat. The samples (156 LCL, 160 skin, 166 fat) were derived simultaneously from a subset of well-phenotyped healthy female twins of the MuTHER resource. We discover an abundance of cis-eQTLs in each tissue similar to previous estimates (858 or 4.7% of genes). In addition, we apply factor analysis (FA) to remove effects of latent variables, thus more than doubling the number of our discoveries (1,822 eQTL genes). The unique study design (Matched Co-Twin Analysis—MCTA) permits immediate replication of eQTLs using co-twins (93%–98%) and validation of the considerable gain in eQTL discovery after FA correction. We highlight the challenges of comparing eQTLs between tissues. After verifying previous significance threshold-based estimates of tissue-specificity, we show their limitations given their dependency on statistical power. We propose that continuous estimates of the proportion of tissue-shared signals and direct comparison of the magnitude of effect on the fold change in expression are essential properties that jointly provide a biologically realistic view of tissue-specificity. Under this framework we demonstrate that 30% of eQTLs are shared among the three tissues studied, while another 29% appear exclusively tissue-specific. However, even among the shared eQTLs, a substantial proportion (10%–20%) have significant differences in the magnitude of fold change between genotypic classes across tissues. Our results underline the need to account for the complexity of eQTL tissue-specificity in an effort to assess consequences of such variants for complex traits.
Regulation of gene expression is a fundamental cellular process determining a large proportion of the phenotypic variance. Previous studies have identified genetic loci influencing gene expression levels (eQTLs), but the complexity of their tissue-specific properties has not yet been well-characterized. In this study, we perform cis-eQTL analysis in a unique matched co-twin design for three human tissues derived simultaneously from the same set of individuals. The study design allows validation of the substantial discoveries we make in each tissue. We explore in depth the tissue-dependent features of regulatory variants and estimate the proportions of shared and specific effects. We use continuous measures of eQTL sharing to circumvent the statistical power limitations of comparing direct overlap of eQTLs in multiple tissues. In this framework, we demonstrate that 30% of eQTLs are shared among tissues, while 29% are exclusively tissue-specific. Furthermore, we show that the fold change in expression between eQTL genotypic classes differs between tissues. Even among shared eQTLs, we report a substantial proportion (10%–20%) of significant tissue differences in magnitude of these effects. The complexities we highlight here are essential for understanding the impact of regulatory variants on complex traits.
Even within a defined cell type, the expression level of a gene differs in individual samples. The effects of genotype, measured factors such as environmental conditions, and their interactions have been explored in recent studies. Methods have also been developed to identify unmeasured intermediate factors that coherently influence transcript levels of multiple genes. Here, we show how to bring these two approaches together and analyse genetic effects in the context of inferred determinants of gene expression. We use a sparse factor analysis model to infer hidden factors, which we treat as intermediate cellular phenotypes that in turn affect gene expression in a yeast dataset. We find that the inferred phenotypes are associated with locus genotypes and environmental conditions and can explain genetic associations to genes in trans. For the first time, we consider and find interactions between genotype and intermediate phenotypes inferred from gene expression levels, complementing and extending established results.
The first step in transmitting heritable information, expressing RNA molecules, is highly regulated and depends on activations of specific pathways and regulatory factors. The state of the cell is hard to measure, making it difficult to understand what drives the changes in the gene expression. To close this gap, we apply a statistical model to infer the state of the cell, such as activations of transcription factors and molecular pathways, from gene expression data. We demonstrate how the inferred state helps to explain the effects of variation in the DNA and environment on the expression trait via both direct regulatory effects and interactions with the genetic state. Such analysis, exploiting inferred intermediate phenotypes, will aid understanding effects of genetic variability on global traits and will help to interpret the data from existing and forthcoming large scale studies.
Chromosome segregation and cell division are essential, highly ordered processes that depend on numerous protein complexes. Results from recent RNA interference (RNAi) screens indicate that the identity and composition of these protein complexes is incompletely understood. Using gene tagging on bacterial artificial chromosomes, protein localization and tandem affinity purification-mass spectrometry, the MitoCheck consortium has analyzed about 100 human protein complexes, many of which had not or only incompletely been characterized. This work has led to the discovery of previously unknown, evolutionarily conserved subunits of the anaphase-promoting complex (APC/C) and the γ-tubulin ring complex (γ-TuRC), large complexes which are essential for spindle assembly and chromosome segregation. The approaches we describe here are generally applicable to high throughput follow-up analyses of phenotypic screens in mammalian cells.
The Ensembl project (http://www.ensembl.org) seeks to enable genomic science by providing high quality, integrated annotation on chordate and selected eukaryotic genomes within a consistent and accessible infrastructure. All supported species include comprehensive, evidence-based gene annotations and a selected set of genomes includes additional data focused on variation, comparative, evolutionary, functional and regulatory annotation. The most advanced resources are provided for key species including human, mouse, rat and zebrafish reflecting the popularity and importance of these species in biomedical research. As of Ensembl release 59 (August 2010), 56 species are supported of which 5 have been added in the past year. Since our previous report, we have substantially improved the presentation and integration of both data of disease relevance and the regulatory state of different cell types.
Correct gene predictions are crucial for most analyses of genomes. However, in the absence of transcript data, gene prediction is still challenging. One way to improve gene-finding accuracy in such genomes is to combine the exons predicted by several gene-finders, so that gene-finders that make uncorrelated errors can correct each other.
We present a method for combining gene-finders called Genomix. Genomix selects the predicted exons that are best conserved within and/or between species in terms of sequence and intron–exon structure, and combines them into a gene structure. Genomix was used to combine predictions from four gene-finders for Caenorhabditis elegans, by selecting the predicted exons that are best conserved with C.briggsae and C.remanei. On a set of ~1500 confirmed C.elegans genes, Genomix increased the exon-level specificity by 10.1% and sensitivity by 2.7% compared to the best input gene-finder.
Motivation: Sequence assembly is a difficult problem whose importance has grown again recently as the cost of sequencing has dramatically dropped. Most new sequence assembly software has started by building a de Bruijn graph, avoiding the overlap-based methods used previously because of the computational cost and complexity of these with very large numbers of short reads. Here, we show how to use suffix array-based methods that have formed the basis of recent very fast sequence mapping algorithms to find overlaps and generate assembly string graphs asymptotically faster than previously described algorithms.
Results: Standard overlap assembly methods have time complexity O(N2), where N is the sum of the lengths of the reads. We use the Ferragina–Manzini index (FM-index) derived from the Burrows–Wheeler transform to find overlaps of length at least τ among a set of reads. As well as an approach that finds all overlaps then implements transitive reduction to produce a string graph, we show how to output directly only the irreducible overlaps, significantly shrinking memory requirements and reducing compute time to O(N), independent of depth. Overlap-based assembly methods naturally handle mixed length read sets, including capillary reads or long reads promised by the third generation sequencing technologies. The algorithms we present here pave the way for overlap-based assembly approaches to be developed that scale to whole vertebrate genome de novo assembly.
The interpretation of genome sequences requires reliable and standardized methods to assess protein function at high throughput. Here we describe a fast and reliable pipeline to study protein function in mammalian cells based on protein tagging in bacterial artificial chromosomes (BACs). The large size of the BAC transgenes ensures the presence of most, if not all, regulatory elements and results in expression that closely matches that of the endogenous gene. We show that BAC transgenes can be rapidly and reliably generated using 96-well-format recombineering. After stable transfection of these transgenes into human tissue culture cells or mouse embryonic stem cells, the localization, protein-protein and/or protein-DNA interactions of the tagged protein are studied using generic, tag-based assays. The same high-throughput approach will be generally applicable to other model systems.
Gene expression measurements are influenced by a wide range of factors, such as the state of the cell, experimental conditions and variants in the sequence of regulatory regions. To understand the effect of a variable of interest, such as the genotype of a locus, it is important to account for variation that is due to confounding causes. Here, we present VBQTL, a probabilistic approach for mapping expression quantitative trait loci (eQTLs) that jointly models contributions from genotype as well as known and hidden confounding factors. VBQTL is implemented within an efficient and flexible inference framework, making it fast and tractable on large-scale problems. We compare the performance of VBQTL with alternative methods for dealing with confounding variability on eQTL mapping datasets from simulations, yeast, mouse, and human. Employing Bayesian complexity control and joint modelling is shown to result in more precise estimates of the contribution of different confounding factors resulting in additional associations to measured transcript levels compared to alternative approaches. We present a threefold larger collection of cis eQTLs than previously found in a whole-genome eQTL scan of an outbred human population. Altogether, 27% of the tested probes show a significant genetic association in cis, and we validate that the additional eQTLs are likely to be real by replicating them in different sets of individuals. Our method is the next step in the analysis of high-dimensional phenotype data, and its application has revealed insights into genetic regulation of gene expression by demonstrating more abundant cis-acting eQTLs in human than previously shown. Our software is freely available online at http://www.sanger.ac.uk/resources/software/peer/.
Gene expression is a complex phenotype. The measured expression level in an experiment can be affected by a wide range of factors—state of the cell, experimental conditions, variants in the sequence of regulatory regions, and others. To understand genotype-to-phenotype relationships, we need to be able to distinguish the variation that is due to the genetic state from all the confounding causes. We present VBQTL, a probabilistic method for dissecting gene expression variation by jointly modelling the underlying global causes of variability and the genetic effect. Our method is implemented in a flexible framework that allows for quick model adaptation and comparison with alternative models. The probabilistic approach yields more accurate estimates of the contributions from different sources of variation. Applying VBQTL, we find that common genetic variation controlling gene expression levels in human is more abundant than previously shown, which has implications for a wide range of studies relating genotype to phenotype.
Motivation: Many programs for aligning short sequencing reads to a reference genome have been developed in the last 2 years. Most of them are very efficient for short reads but inefficient or not applicable for reads >200 bp because the algorithms are heavily and specifically tuned for short queries with low sequencing error rate. However, some sequencing platforms already produce longer reads and others are expected to become available soon. For longer reads, hashing-based software such as BLAT and SSAHA2 remain the only choices. Nonetheless, these methods are substantially slower than short-read aligners in terms of aligned bases per unit time.
Results: We designed and implemented a new algorithm, Burrows-Wheeler Aligner's Smith-Waterman Alignment (BWA-SW), to align long sequences up to 1 Mb against a large sequence database (e.g. the human genome) with a few gigabytes of memory. The algorithm is as accurate as SSAHA2, more accurate than BLAT, and is several to tens of times faster than both.