To identify loci for age at menarche, we performed a meta-analysis of 32 genome-wide association studies in 87,802 women of European descent, with replication in up to 14,731 women. In addition to the known loci at LIN28B (P=5.4×10−60) and 9q31.2 (P=2.2×10−33), we identified 30 novel menarche loci (all P<5×10−8) and found suggestive evidence for a further 10 loci (P<1.9×10−6). New loci included four previously associated with BMI (in/near FTO, SEC16B, TRA2B and TMEM18), three in/near other genes implicated in energy homeostasis (BSX, CRTC1, and MCHR2), and three in/near genes implicated in hormonal regulation (INHBA, PCSK2 and RXRG). Ingenuity and MAGENTA pathway analyses identified coenzyme A and fatty acid biosynthesis as biological processes related to menarche timing.
The epigenome orchestrates genome accessibility, functionality and three-dimensional structure. Because epigenetic variation can impact transcription and thus phenotypes, it may contribute to adaptation. Here we report 1,107 high-quality single-base resolution methylomes and 1,203 transcriptomes from the 1001 Genomes collection of Arabidopsis thaliana. Although the genetic basis of methylation variation is highly complex, geographic origin is a major predictor of genome-wide DNA methylation levels and of altered gene expression caused by epialleles. Comparison to cistrome and epicistrome datasets identifies associations between transcription factor binding sites, methylation, nucleotide variation and co-expression modules. Physical maps for nine of the most diverse genomes reveals how transposons and other structural variants shape the epigenome, with dramatic effects on immunity genes. The 1001 Epigenomes Project provides a comprehensive resource for understanding how variation in DNA methylation contributes to molecular and non-molecular phenotypes in natural populations of the most studied model plant.
Supplemental Digital Content is available in the text.
Clinical response to the atypical antipsychotic paliperidone is known to vary among schizophrenic patients. We carried out a genome-wide association study to identify common genetic variants predictive of paliperidone efficacy.
We leveraged a collection of 1390 samples from individuals of European ancestry enrolled in 12 clinical studies investigating the efficacy of the extended-release tablet paliperidone ER (n1=490) and the once-monthly injection paliperidone palmitate (n2=550 and n3=350). We carried out a genome-wide association study using a general linear model (GLM) analysis on three separate cohorts, followed by meta-analysis and using a mixed linear model analysis on all samples. The variations in response explained by each single nucleotide polymorphism (h2SNP) were estimated.
No SNP passed genome-wide significance in the GLM-based analyses with suggestive signals from rs56240334 [P=7.97×10−8 for change in the Clinical Global Impression Scale-Severity (CGI-S); P=8.72×10−7 for change in the total Positive and Negative Syndrome Scale (PANSS)] in the intron of ADCK1. The mixed linear model-based association P-values for rs56240334 were consistent with the results from GLM-based analyses and the association with change in CGI-S (P=4.26×10−8) reached genome-wide significance (i.e. P<5×10−8). We also found suggestive evidence for a polygenic contribution toward paliperidone treatment response with estimates of heritability, h2SNP, ranging from 0.31 to 0.43 for change in the total PANSS score, the PANSS positive Marder factor score, and CGI-S.
Genetic variations in the ADCK1 gene may differentially predict paliperidone efficacy in schizophrenic patients. However, this finding should be replicated in additional samples.
ADCK1; antipsychotics; genome-wide association study; heritability; paliperidone; pharmacogenetics; pharmacogenomics; polygenic; schizophrenia
Neurophysiological measurements of the response to prepulse and startle stimuli have been suggested to represent an important endophenotype for both substance dependence and other select psychiatric disorders. We have previously shown, in young adult Mexican Americans (MA), that presentation of a short delay acoustic prepulse, prior to the startle stimuli can elicit a late negative component at about 400 msec (N4S), in the event-related potential (ERP), recorded from frontal cortical areas. In the present study we investigated whether genetic factors associated with this endophenotype could be identified. The study included 420 (age 18 – 30 years) MA men (n=170) and women (n=250). DNA was genotyped using an Affymetrix Axiom Exome1A chip. An association analysis revealed that the CCKAR and CCKBR (cholecystokinin A and B receptor) genes each had a nearby variant that showed suggestive significance with the amplitude of the N4S component to prepulse stimuli. The neurotransmitter cholecystokinin (CCK), along with its receptors, CCKAR and CCKBR, have been previously associated with psychiatric disorders, suggesting that variants near these genes may play a role in the prepulse/startle response in this cohort.
Mexican Americans; startle response; EEG; ERP; alcohol dependence
Mitochondrial DNA (mtDNA) heteroplasmy is a mixture of normal and mutated mtDNA molecules in a cell. High levels of heteroplasmy at specific mtDNA sites lead to inherited mitochondrial diseases with neurological, sensory, and movement impairments. Here we test the hypothesis that heteroplasmy levels in elderly adults are associated with impaired function resembling mild forms of mitochondrial disease.
We examined platelet mtDNA heteroplasmy at 20 disease-causing sites for associations with neurosensory and mobility function among 137 participants from the community-based Health, Aging, and Body Composition Study.
Elevated mtDNA heteroplasmy at four mtDNA sites in complex I and tRNA genes was nominally associated with reduced cognition, vision, hearing, and mobility: m.10158T>C with Modified Mini-Mental State Examination score (p = .009); m.11778G>A with contrast sensitivity (p = .02); m.7445A>G with high-frequency hearing (p = .047); and m.5703G>A with 400 m walking speed (p = .007).
These results indicate that increased mtDNA heteroplasmy at disease-causing sites is associated with neurosensory and mobility function in older persons. We propose the novel use of mtDNA heteroplasmy as a simple, noninvasive predictor of age-related neurologic, sensory, and movement impairments.
Mitochondrial DNA; Heteroplasmy; Cognition; Vision; Hearing; Mobility.
Previous studies have shown that individuals with schizophrenia have a greater risk for psoriasis than a typical person. This suggests that there might be a shared genetic etiology between the 2 conditions. We aimed to characterize the potential shared genetic susceptibility between schizophrenia and psoriasis using genome-wide marker genotype data.
We obtained genetic data on individuals with psoriasis, schizophrenia and control individuals. We applied a marker-based coheritability estimation procedure, polygenic score analysis, a gene set enrichment test and a least absolute shrinkage and selection operator regression model to estimate the potential shared genetic etiology between the 2 diseases. We validated the results in independent schizophrenia and psoriasis cohorts from Singapore.
We included 1139 individuals with psoriasis, 744 with schizophrenia and 1678 controls in our analysis, and we validated the results in independent cohorts, including 441 individuals with psoriasis (and 2420 controls) and 1630 with schizophrenia (and 1860 controls). We estimated that a large fraction of schizophrenia and psoriasis risk could be attributed to common variants (h2SNP = 29% ± 5.0%, p = 2.00 × 10−8), with a coheritability estimate between the traits of 21%. We identified 5 variants within the human leukocyte antigen (HLA) gene region, which were most likely to be associated with both diseases and collectively conferred a significant risk effect (odds ratio of highest risk quartile = 6.03, p < 2.00 × 10−16). We discovered that variants contributing most to the shared heritable component between psoriasis and schizophrenia were enriched in antigen processing and cell endoplasmic reticulum.
Our sample size was relatively small. The findings of 5 HLA gene variants were complicated by the complex structure in the HLA region.
We found evidence for a shared genetic etiology between schizophrenia and psoriasis. The mechanism for this shared genetic basis likely involves immune and calcium signalling pathways.
The contribution of collections of rare sequence variations (or ‘variants’) to phenotypic expression has begun to receive considerable attention within the biomedical research community. However, the best way to capture the effects of rare variants in relevant statistical analysis models is an open question. In this paper we describe the application of a number of statistical methods for testing associations between rare variants in two genes to obesity. We consider the relative merits of the different methods as well as important implementation details, such as the leveraging of genomic annotations and determining p-values.
The many subcomponents of the human cortex are known to follow an anatomical pattern and functional relationship that appears to be highly conserved between individuals. This suggests that this pattern and the relationship among cortical regions are important for cortical function and likely shaped by genetic factors, although the degree to which genetic factors contribute to this pattern is unknown. We assessed the genetic relationships among 12 cortical surface areas using brain images and genotype information on 2,364 unrelated individuals, brain images on 466 twin pairs, and transcriptome data on 6 postmortem brains in order to determine whether a consistent and biologically meaningful pattern could be identified from these very different data sets. We find that the patterns revealed by each data set are highly consistent (p<10−3), and are biologically meaningful on several fronts. For example, close genetic relationships are seen in cortical regions within the same lobes and, the frontal lobe, a region showing great evolutionary expansion and functional complexity, has the most distant genetic relationship with other lobes. The frontal lobe also exhibits the most distinct expression pattern relative to the other regions, implicating a number of genes with known functions mediating immune and related processes. Our analyses reflect one of the first attempts to provide an assessment of the biological consistency of a genetic phenomenon involving the brain that leverages very different types of data, and therefore is not just statistical replication which purposefully use very similar data sets.
Although functional and anatomical connections among cortical regions have been intensively explored, genetically-mediated relationships between cortical regions have not been pursued to the same degree. Identifying genetic factors that mediate these relationships among different brain subcomponents can provide insight into how the human brain is organized and functions. We have assessed the genetic relationships among cortical regions using an integrated approach that considers twin data, genotype information among a large set of unrelated individuals, and gene expression measurements from postmortem neural tissues. We looked for evidence that subsets of cortical brain regions are under common or unique genetic control. We found that the patterns of genetic relationships are highly consistent across three independent data sets and multiple lines of evidence, suggesting that the patterning of cortical surface area is strongly mediated by genetic factors and, furthermore, likely reflects underlying anatomical and possibly functional relationships among cortical brain regions.
It is now feasible to examine the composition and diversity of microbial communities (i.e., “microbiomes”) that populate different human organs and orifices using DNA sequencing and related technologies. To explore the potential links between changes in microbial communities and various diseases in the human body, it is essential to test associations involving different species within and across microbiomes, environmental settings and disease states. Although a number of statistical techniques exist for carrying out relevant analyses, it is unclear which of these techniques exhibit the greatest statistical power to detect associations given the complexity of most microbiome datasets. We compared the statistical power of principal component regression, partial least squares regression, regularized regression, distance-based regression, Hill's diversity measures, and a modified test implemented in the popular and widely used microbiome analysis methodology “Metastats” across a wide range of simulated scenarios involving changes in feature abundance between two sets of metagenomic samples. For this purpose, simulation studies were used to change the abundance of microbial species in a real dataset from a published study examining human hands. Each technique was applied to the same data, and its ability to detect the simulated change in abundance was assessed. We hypothesized that a small subset of methods would outperform the rest in terms of the statistical power. Indeed, we found that the Metastats technique modified to accommodate multivariate analysis and partial least squares regression yielded high power under the models and data sets we studied. The statistical power of diversity measure-based tests, distance-based regression and regularized regression was significantly lower. Our results provide insight into powerful analysis strategies that utilize information on species counts from large microbiome data sets exhibiting skewed frequency distributions obtained on a small to moderate number of samples.
multivariate regression; diversity; abundance; metagenomics; microbiome; statistical power
The Scripps Idiopathic Diseases of huMan (IDIOM) study aims to discover novel gene-disease relationships and provide molecular genetic diagnosis and treatment guidance for individuals with novel diseases using genome sequencing integrated with clinical assessment and multidisciplinary case review.
Here we describe the IDIOM study operational protocol and initial results.
121 cases underwent first tier review by the principal investigators to determine if the primary inclusion criteria were satisfied, 59 (48.8%) underwent second tier review by our clinician-scientist review panel, and 17 (14.0%) patients and their family members were enrolled. 60% of cases resulted in a plausible molecular diagnosis. 18% of cases resulted in a confirmed molecular diagnosis. 2 of 3 confirmed cases led to the identification of novel gene-disease relationships. In the third confirmed case, a previously described but unrecognized disease was revealed. In all three confirmed cases, a new clinical management strategy was initiated based on the genetic findings.
Genome sequencing provides tangible clinical benefit for individuals with idiopathic genetic disease, not only in the context of molecular genetic diagnosis of known rare conditions, but also in cases where prior clinical information regarding a new genetic disorder is lacking.
clinical sequencing; rare disease; genomics; undiagnosed diseases; genome sequencing
Analysis of the biological gene networks involved in a disease may lead to the identification of therapeutic targets. Such analysis requires exploring network properties, in particular the importance of individual network nodes (i.e., genes). There are many measures that consider the importance of nodes in a network and some may shed light on the biological significance and potential optimality of a gene or set of genes as therapeutic targets. This has been shown to be the case in cancer therapy. A dilemma exists, however, in finding the best therapeutic targets based on network analysis since the optimal targets should be nodes that are highly influential in, but not toxic to, the functioning of the entire network. In addition, cancer therapeutics targeting a single gene often result in relapse since compensatory, feedback and redundancy loops in the network may offset the activity associated with the targeted gene. Thus, multiple genes reflecting parallel functional cascades in a network should be targeted simultaneously, but require the identification of such targets. We propose a methodology that exploits centrality statistics characterizing the importance of nodes within a gene network that is constructed from the gene expression patterns in that network. We consider centrality measures based on both graph theory and spectral graph theory. We also consider the origins of a network topology, and show how different available representations yield different node importance results. We apply our techniques to tumor gene expression data and suggest that the identification of optimal therapeutic targets involving particular genes, pathways and sub-networks based on an analysis of the nodes in that network is possible and can facilitate individualized cancer treatments. The proposed methods also have the potential to identify candidate cancer therapeutic targets that are not thought to be oncogenes but nonetheless play important roles in the functioning of a cancer-related network or pathway.
network analysis; centrality; cancer; pathway; drug targets; personalized treatment; gene expression
A central goal of gene expression studies coupled with drug response screens is to identify predictive profiles that can be exploited to stratify patients. Numerous methods have been proposed towards this end, most focusing on novel statistical methods and model selection techniques which attempt to uncover groups of genes whose expression profiles are directly and robustly correlated with drug response. However, biological systems process information through the crosstalk of multiple signaling networks, whose ultimate phenotypic consequences may only be determined by the combined input of relevant interacting systems. By restricting predictive signatures to direct gene-drug correlations, biologically meaningful interactions that may serve as superior predictors are ignored. Here we demonstrate that predictive signatures which incorporate the interaction between background gene expression patterns and individual predictive probes can provide superior models than those that directly relate gene expression levels to pharmacological response, and thus should be more widely utilized in pharmacogenetic studies.
Body mass index (BMI) is a well-known measure of obesity with a multitude of genetic and non-genetic determinants. Identifying the underlying factors associated with BMI is difficult because of its multifactorial etiology that varies as a function of geoethnic background and socioeconomic setting. Thus, we pursued a study exploring the influence of the degree of Native American admixture on BMI (as well as weight and height individually) in a community sample of Native Americans (n=846) while accommodating a variety of socioeconomic and cultural factors.
Participants’ degree of Native American (NA) ancestry was estimated using a genome-wide panel of markers. The participants also completed an extensive survey of cultural and social identity measures: the Indian Culture Scale (ICS) and the Orthogonal Cultural Identification Scale (OCIS). Multiple linear regression was used to examine the relation between these measures and BMI.
Our results suggest that BMI is correlated positively with the proportion of NA ancestry. Age was also significantly associated with BMI, while gender and socioeconomic measures (education and income) were not. For the two cultural identity measures, the ICS showed a positive correlation with BMI, while OCIS was not associated with BMI.
Taken together, these results suggest that genetic and cultural environmental factors, rather than socioeconomic factors, account for a substantial proportion of variation in BMI in this population. Further, significant correlations between degree of NA ancestry and BMI suggest that admixture mapping may be appropriate to identify loci associated with BMI in this population.
Genetic ancestry; admixture; body habitus; obesity; Native Americans
There is considerable debate about the most efficient way to interrogate rare coding variants in association studies. The options include direct genotyping of specific known coding variants in genes or, alternatively, sequencing across the entire exome to capture known as well as novel variants. Each strategy has advantages and disadvantages, but the availability of cost-efficient exome arrays has made the former appealing. Here we consider the utility of a direct genotyping chip, the Illumina HumanExome array (HE), by evaluating its content based on: 1. functionality; and 2. amenability to imputation. We explored these issues by genotyping a large, ethnically diverse cohort on the HumanOmniExpressExome array (HOEE) which combines the HE with content from the GWAS array (HOE). We find that the use of the HE is likely to be a cost-effective way of expanding GWAS, but does have some drawbacks that deserve consideration when planning studies.
Illumina HumanExome array; expanding GWAS; genotyping rare SNPs; coding variants
Dyslexia and language impairment (LI) are complex traits with substantial genetic components. We recently completed an association scan of the DYX2 locus, where we observed associations of markers in DCDC2, KIAA0319, ACOT13, and FAM65B with reading-, language-, and IQ-related traits. Additionally, the effects of reading-associated DYX3 markers were recently characterized using structural neuroimaging techniques. Here, we assessed the neuroimaging implications of associated DYX2 and DYX3 markers, using cortical volume, cortical thickness, and fractional anisotropy. To accomplish this, we examined eight DYX2 and three DYX3 markers in 332 subjects in the Pediatrics Imaging Neurocognition Genetics study. Imaging-genetic associations were examined by multiple linear regression, testing for influence of genotype on neuroimaging. Markers in DYX2 genes KIAA0319 and FAM65B were associated with cortical thickness in the left orbitofrontal region and global fractional anisotropy, respectively. KIAA0319 and ACOT13 were suggestively associated with overall fractional anisotropy and left pars opercularis cortical thickness, respectively. DYX3 markers showed suggestive associations with cortical thickness and volume measures in temporal regions. Notably, we did not replicate association of DYX3 markers with hippocampal measures. In summary, we performed a neuroimaging follow-up of reading-, language-, and IQ-associated DYX2 and DYX3 markers. DYX2 associations with cortical thickness may reflect variations in their role in neuronal migration. Furthermore, our findings complement gene expression and imaging studies implicating DYX3 markers in temporal regions. These studies offer insight into where and how DYX2 and DYX3 risk variants may influence neuroimaging traits. Future studies should further connect the pathways to risk variants associated with neuroimaging/neurocognitive outcomes.
Dyslexia; Language impairment; KIAA0319; DYX3; DYX2; Imaging-genetics
There is concern that the stresses of inducing pluripotency may lead to deleterious DNA mutations in induced pluripotent stem cell (iPSC) lines, which would compromise their use for cell therapies. Here we report comparative genomic analysis of nine isogenic iPSC lines generated using three reprogramming methods: integrating retroviral vectors, non-integrating Sendai virus and synthetic mRNAs. We used whole-genome sequencing and de novo genome mapping to identify single-nucleotide variants, insertions and deletions, and structural variants. Our results show a moderate number of variants in the iPSCs that were not evident in the parental fibroblasts, which may result from reprogramming. There were only small differences in the total numbers and types of variants among different reprogramming methods. Most importantly, a thorough genomic analysis showed that the variants were generally benign. We conclude that the process of reprogramming is unlikely to introduce variants that would make the cells inappropriate for therapy.
It is feared that reprogramming may introduce DNA mutations. Here Bhutani et al. take three different reprogramming methods and using comparative whole genome analyses do identify nucleotide variations that are different in reprogrammed cells from the original fibroblasts, but none convey oncogenic potential.
Low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol (HDL-C), and triglycerides (TG) are modifiable risk factors for cardiovascular disease. Several genetic loci for predisposition to abnormal LDL-C, HDL-C and TG have been identified. However, it remains unclear whether these loci are consistently associated with serum lipid levels at each age or with unique developmental trajectories. Therefore, we assessed the association between genome wide association studies (GWAS) derived polygenic genetic risk scores and LDL-C, HDL-C, and triglyceride trajectories from childhood to adulthood using data available from the 27-year European ‘Cardiovascular Risk in Young Finns’ Study. For 2,442 participants, three weighted genetic risk scores (wGRSs) for HDL-C (38 SNPs), LDL-C (14 SNPs) and triglycerides (24 SNPs) were computed and tested for association with serum lipoprotein levels measured up to 8 times between 1980 and 2011. The categorical analyses revealed no clear divergence of blood lipid trajectories over time between wGRSs categories, with participants in the lower wGRS quartiles tending to have average lipoprotein concentrations 30 to 45% lower than those in the upper-quartile wGRS beginning at age 3 years and continuing through to age 49 years (where the upper-quartile wGRS have 4–7 more risk alleles than the lower wGRS group). Continuous analyses, however, revealed a significant but moderate time-dependent genetic interaction for HDL-C levels, with the association between HDL-C and the continuous HDL-C risk score weakening slightly with age. Conversely, in males, the association between the continuous TG genetic risk score and triglycerides levels tended to be lower in childhood and become more pronounced after the age of 25 years. Although the influence of genetic factors on age-specific lipoprotein values and developmental trajectories is complex, our data show that wGRSs are highly predictive of HDL-C, LDL-C, and triglyceride levels at all ages.
Advances in DNA sequencing technologies have made it possible to rapidly, accurately and affordably sequence entire individual human genomes. As impressive as this ability seems, however, it will not likely to amount to much if one cannot extract meaningful information from individual sequence data. Annotating variations within individual genomes and providing information about their biological or phenotypic impact will thus be crucially important in moving individual sequencing projects forward, especially in the context of the clinical use of sequence information. In this paper we consider the various ways in which one might annotate individual sequence variations and point out limitations in the available methods for doing so. It is arguable that, in the foreseeable future, DNA sequencing of individual genomes will become routine for clinical, research, forensic, and personal purposes. We therefore also consider directions and areas for further research in annotating genomic variants.
Sequencing; functional analysis; computer modeling; genomic variation
Higher rates of alcohol use and other drug-dependence have been observed in some Native American populations relative to other ethnic groups in the U.S. Previous studies have shown that alcohol dehydrogenase (ADH) genes and aldehyde dehydrogenase (ALDH) genes may affect the risk of development of alcohol dependence, and that polymorphisms within these genes may differentially affect risk for the disorder depending on the ethnic group evaluated. We evaluated variations in the ADH and ALDH genes in a large study investigating risk factors for substance use in a Native American population. We assessed ancestry admixture and tested for associations between alcohol-related phenotypes in the genomic regions around the ADH1-7 and ALDH2 and ALDH1A1 genes. Seventy-two (72) ADH variants showed significant evidence of association with a severity level of alcohol drinking-related dependence symptoms phenotype. These significant variants spanned across the entire 7 ADH gene cluster regions. Two significant associations, one in ADH and one in ALDH2, were observed with alcohol dependence diagnosis. Seventeen (17) variants showed significant association with the largest number of alcohol drinks ingested during any 24-hour period. Variants in or near ADH7 were significantly negatively associated with alcohol-related phenotypes, suggesting a potential protective effect of this gene. In addition, our results suggested that a higher degree of Native American ancestry is associated with higher frequencies of potential risk variants and lower frequencies of potential protective variants for alcohol dependence phenotypes.
Alcoholism; Low coverage sequencing; Admixture; Alcohol metabolizing enzymes
Socioeconomic disparities are associated with differences in cognitive development. The extent to which this translates to disparities in brain structure is unclear. Here, we investigated relationships between socioeconomic factors and brain morphometry, independently of genetic ancestry, among a cohort of 1099 typically developing individuals between 3 and 20 years. Income was logarithmically associated with brain surface area. Specifically, among children from lower income families, small differences in income were associated with relatively large differences in surface area, whereas, among children from higher income families, similar income increments were associated with smaller differences in surface area. These relationships were most prominent in regions supporting language, reading, executive functions and spatial skills; surface area mediated socioeconomic differences in certain neurocognitive abilities. These data indicate that income relates most strongly to brain structure among the most disadvantaged children. Potential implications are discussed.
Dental caries remains a significant public health problem and is considered pandemic worldwide. The prediction of dental caries based on profiling of microbial species involved in disease and equally important, the identification of species conferring dental health has proven more difficult than anticipated due to high interpersonal and geographical variability of dental plaque microbiota. We have used RNA-Seq to perform global gene expression analysis of dental plaque microbiota derived from 19 twin pairs that were either concordant (caries-active or caries-free) or discordant for dental caries. The transcription profiling allowed us to define a functional core microbiota consisting of nearly 60 species. Similarities in gene expression patterns allowed a preliminary assessment of the relative contribution of human genetics, environmental factors and caries phenotype on the microbiota's transcriptome. Correlation analysis of transcription allowed the identification of numerous functional networks, suggesting that inter-personal environmental variables may co-select for groups of genera and species. Analysis of functional role categories allowed the identification of dominant functions expressed by dental plaque biofilm communities, that highlight the biochemical priorities of dental plaque microbes to metabolize diverse sugars and cope with the acid and oxidative stress resulting from sugar fermentation. The wealth of data generated by deep sequencing of expressed transcripts enables a greatly expanded perspective concerning the functional expression of dental plaque microbiota.
caries; oral microbiota; dental plaque; biofilm; transcriptome
Next-generation sequencing (NGS) technologies have become much more efficient, allowing whole human genomes to be sequenced faster and cheaper than ever before. However, processing the raw sequence reads associated with NGS technologies requires care and sophistication in order to draw compelling inferences about phenotypic consequences of variation in human genomes. It has been shown that different approaches to variant calling from NGS data can lead to different conclusions. Ensuring appropriate accuracy and quality in variant calling can come at a computational cost.
We describe our experience implementing and evaluating a group-based approach to calling variants on large numbers of whole human genomes. We explore the influence of many factors that may impact the accuracy and efficiency of group-based variant calling, including group size, the biogeographical backgrounds of the individuals who have been sequenced, and the computing environment used. We make efficient use of the Gordon supercomputer cluster at the San Diego Supercomputer Center by incorporating job-packing and parallelization considerations into our workflow while calling variants on 437 whole human genomes generated as part of large association study.
We ultimately find that our workflow resulted in high-quality variant calls in a computationally efficient manner. We argue that studies like ours should motivate further investigations combining hardware-oriented advances in computing systems with algorithmic developments to tackle emerging ‘big data’ problems in biomedical research brought on by the expansion of NGS technologies.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-015-0736-4) contains supplementary material, which is available to authorized users.
Variant calling; Supercomputing; Whole-genome sequencing
The limitations of genome-wide association (GWA) studies that focus on the phenotypic influence of common genetic variants have motivated human geneticists to consider the contribution of rare variants to phenotypic expression. The increasing availability of high-throughput sequencing technology has enabled studies of rare variants, but will not be sufficient for their success since appropriate analytical methods are also needed. We consider data analysis approaches to testing associations between a phenotype and collections of rare variants in a defined genomic region or set of regions. Ultimately, although a wide variety of analytical approaches exist, more work is needed to refine them and determine their properties and power in different contexts.
Insulin-like growth factors (IGF) 1 and 2 are known as potential mitogens for normal and neoplastic cells. IGF2 is a main fetal growth factor while IGF1 is activated through growth hormone action during postnatal growth and development. However, there is strong evidence that activation of IGF2 by its E2F transcription factor 3 (E2F3) is present in different types of cancer. Also high levels of IGF1 strongly correlate with cancer development due to anti-apoptotic properties and enhancement of cancer cell differentiation, which can be attenuated by IGFBP3. Head and neck cancer is known as one of the six most common human cancers. The main risk factor for head and neck cancer is consumption of tobacco and alcohol as well as viral and bacterial infection by stimulation of chronic local inflammation. There is also a genetic basis for this form of cancer; however, the genetic markers are not yet established. In this study we investigated the levels of the expression of IGF2, IGF1, E2F3 and IGFBP3 in human cancers and healthy tissues surrounding the tumor obtained from each of 41 patients. Our study indicated that there is no alteration of the level of expression of IGF2, E2F3 and IGF1 in Head and neck squamous cell carcinoma (HNSCC) cases studied in selected experimental population, but there was evidence for upregulation of pro-apoptotic IGFBP3 in cancer when comparing to healthy tissue. These important findings indicate that insulin-growth factors are not directly associated with HNSCC showing some variability between patients and location of tumor. However, elevated level of IGFBP3 suggests possible regulatory role of IGF signal by its binding protein in this type of tumors.
oral cancer; IGF1; IGF2; survival