Large-scale statistical analyses have become hallmarks of post-genomic era biological research due to advances in high-throughput assays and the integration of large biological databases. One accompanying issue is the simultaneous estimation of p-values for a large number of hypothesis tests. In many applications, a parametric assumption in the null distribution such as normality may be unreasonable, and resampling-based p-values are the preferred procedure for establishing statistical significance. Using resampling-based procedures for multiple testing is computationally intensive and typically requires large numbers of resamples.
We present a new approach to more efficiently assign resamples (such as bootstrap samples or permutations) within a nonparametric multiple testing framework. We formulated a Bayesian-inspired approach to this problem, and devised an algorithm that adapts the assignment of resamples iteratively with negligible space and running time overhead. In two experimental studies, a breast cancer microarray dataset and a genome wide association study dataset for Parkinson's disease, we demonstrated that our differential allocation procedure is substantially more accurate compared to the traditional uniform resample allocation.
Our experiments demonstrate that using a more sophisticated allocation strategy can improve our inference for hypothesis testing without a drastic increase in the amount of computation on randomized data. Moreover, we gain more improvement in efficiency when the number of tests is large. R code for our algorithm and the shortcut method are available at .
Genetic mapping has been used as a tool to study the genetic architecture of complex traits by localizing their underlying quantitative trait loci (QTLs). Statistical methods for genetic mapping rely on a key assumption, that is, traits obey a parametric distribution. However, in practice real data may not perfectly follow the specified distribution.
Here, we derive a robust statistical approach for QTL mapping that accommodates a certain degree of misspecification of the true model by incorporating integrated square errors into the genetic mapping framework. A hypothesis testing is formulated by defining a new test statistics - energy difference.
Simulation studies were performed to investigate the statistical properties of this approach and compare these properties with those from traditional maximum likelihood and non-parametric QTL mapping approaches. Lastly, analyses of real examples were conducted to demonstrate the usefulness and utilization of the new approach in a practical genetic setting.
The resampling-based test, which often relies on permutation or bootstrap procedures, has been widely used for statistical hypothesis testing when the asymptotic distribution of the test statistic is unavailable or unreliable. It requires repeated calculations of the test statistic on a large number of simulated data sets for its significance level assessment, and thus it could become very computationally intensive. Here, we propose an efficient p-value evaluation procedure by adapting the stochastic approximation Markov chain Monte Carlo algorithm. The new procedure can be used easily for estimating the p-value for any resampling-based test. We show through numeric simulations that the proposed procedure can be 100–500 000 times as efficient (in term of computing time) as the standard resampling-based procedure when evaluating a test statistic with a small p-value (e.g. less than 10 − 6). With its computational burden reduced by this proposed procedure, the versatile resampling-based test would become computationally feasible for a much wider range of applications. We demonstrate the application of the new method by applying it to a large-scale genetic association study of prostate cancer.
Bootstrap procedures; Genetic association studies; p-value; Resampling-based tests; Stochastic approximation Markov chain Monte Carlo
Long-range chromosomal associations between genomic regions, and their repositioning in the 3D space of the nucleus, are now considered to be key contributors to the regulation of gene expression and important links have been highlighted with other genomic features involved in DNA rearrangements. Recent Chromosome Conformation Capture (3C) measurements performed with high throughput sequencing (Hi-C) and molecular dynamics studies show that there is a large correlation between colocalization and coregulation of genes, but these important researches are hampered by the lack of biologists-friendly analysis and visualisation software. Here, we describe NuChart, an R package that allows the user to annotate and statistically analyse a list of input genes with information relying on Hi-C data, integrating knowledge about genomic features that are involved in the chromosome spatial organization. NuChart works directly with sequenced reads to identify the related Hi-C fragments, with the aim of creating gene-centric neighbourhood graphs on which multi-omics features can be mapped. Predictions about CTCF binding sites, isochores and cryptic Recombination Signal Sequences are provided directly with the package for mapping, although other annotation data in bed format can be used (such as methylation profiles and histone patterns). Gene expression data can be automatically retrieved and processed from the Gene Expression Omnibus and ArrayExpress repositories to highlight the expression profile of genes in the identified neighbourhood. Moreover, statistical inferences about the graph structure and correlations between its topology and multi-omics features can be performed using Exponential-family Random Graph Models. The Hi-C fragment visualisation provided by NuChart allows the comparisons of cells in different conditions, thus providing the possibility of novel biomarkers identification. NuChart is compliant with the Bioconductor standard and it is freely available at ftp://fileserver.itb.cnr.it/nuchart.
In eukaryotes, most DNA-binding proteins exert their action as members of large effector complexes. The presence of these complexes are revealed in high-throughput genome-wide assays by the co-occurrence of the binding sites of different complex components. Resampling tests are one route by which the statistical significance of apparent co-occurrence can be assessed.
We have investigated two resampling approaches for evaluating the statistical significance of binding-site co-occurrence. The permutation test approach was found to yield overly favourable p-values while the independent resampling approach had the opposite effect and is of little use in practical terms. We have developed a new, pragmatically-devised hybrid approach that, when applied to the experimental results of an Polycomb/Trithorax study, yielded p-values consistent with the findings of that study. We extended our investigations to the FL method developed by Haiminen et al, which derives its null distribution from all binding sites within a dataset, and show that the p-value computed for a pair of factors by this method can depend on which other factors are included in that dataset. Both our hybrid method and the FL method appeared to yield plausible estimates of the statistical significance of co-occurrences although our hybrid method was more conservative when applied to the Polycomb/Trithorax dataset.
A high-performance parallelized implementation of the hybrid method is available.
We propose a new resampling-based co-occurrence significance test and demonstrate that it performs as well as or better than existing methods on a large experimentally-derived dataset. We believe it can be usefully applied to data from high-throughput genome-wide techniques such as ChIP-chip or DamID. The Cooccur package, which implements our approach, accompanies this paper.
In analyzing the stability of DNA replication origins in Saccharomyces cerevisiae we faced the question whether one set of sequences is significantly enriched in the number and/or the quality of the matches of a particular position weight matrix relative to another set.
We present SADMAMA, a computational solution to a address this problem. SADMAMA implements two types of statistical tests to answer this question: one type is based on simplified models, while the other relies on bootstrapping, and as such might be preferable to users who are averse to such models. The bootstrap approach incorporates a novel "site-protected" resampling procedure which solves a problem we identify with naive resampling.
SADMAMA's utility is demonstrated here by offering a plausible explanation to the differential ARS activity observed in our previous mcm1-1 mutant experiments , by suggesting the relevance of multiple weak ACS matches to efficient replication origin function in Saccharomyces cerevisiae, and by suggesting an explanation to the observed negative effect FKH2 has on chromatin silencing . SADMAMA is available for download from .
Community detection helps us simplify the complex configuration of networks, but communities are reliable only if they are statistically significant. To detect statistically significant communities, a common approach is to resample the original network and analyze the communities. But resampling assumes independence between samples, while the components of a network are inherently dependent. Therefore, we must understand how breaking dependencies between resampled components affects the results of the significance analysis. Here we use scientific communication as a model system to analyze this effect. Our dataset includes citations among articles published in journals in the years 1984–2010. We compare parametric resampling of citations with non-parametric article resampling. While citation resampling breaks link dependencies, article resampling maintains such dependencies. We find that citation resampling underestimates the variance of link weights. Moreover, this underestimation explains most of the differences in the significance analysis of ranking and clustering. Therefore, when only link weights are available and article resampling is not an option, we suggest a simple parametric resampling scheme that generates link-weight variances close to the link-weight variances of article resampling. Nevertheless, when we highlight and summarize important structural changes in science, the more dependencies we can maintain in the resampling scheme, the earlier we can predict structural change.
Genome-wide investigations for identifying the genes for complex traits are considered to be agnostic in terms of prior assumptions for the responsible DNA alterations. The agreement of genome-wide association studies (GWAS) and genome-wide linkage scans (GWLS) has not been explored to date. In this study, a genomic convergence approach of GWAS and GWLS was implemented for the first time in order to identify genomic loci supported by both methods. A database with 376 GWLS and 102 GWAS for 19 complex traits was created. Data regarding the location and statistical significance for each genetic marker were extracted from articles or web-based databases. Convergence was quantified as the proportion of significant GWAS markers located within linked regions. Convergence was variable (0–73.3%) and was found to be significantly higher than expected by chance only for two of the 19 phenotypes. Seventy five loci of interest were identified, which being supported by independent lines of evidence, could merit prioritization in future investigations. Although convergence is supportive of genuine effects, lack of agreement between GWLS and GWAS is also indicative that these studies are designed to answer different questions and are not equally well suited for deciphering the genetics of complex traits.
genomic convergence; linkage; association; complex traits
Transcriptionally silent regions of the Saccharomyces cerevisiae genome, the silent mating type loci and telomeres, represent the yeast equivalent of metazoan heterochromatin. To gain insight into the nature of silenced chromatin structure, we have examined the topology of DNA spanning the HML silent mating type locus by determining the superhelical density of mini-circles excised from HML (HML circles) by site-specific recombination. We observed that HML circles excised in a wild-type (SIR+) strain were more negatively supercoiled upon deproteinization than were the same circles excised in a sir- strain, in which silencing was abolished, even when HML alleles in which neither circle was transcriptionally competent were used. cis-acting sites flanking HML, called silencers, are required in the chromosome for establishment and inheritance of silencing. HML circles excised without silencers from cells arrested at any point in the cell cycle retained SIR-dependent differences in superhelical density. However, progression through the cell cycle converted SIR+ HML circles to a form resembling that of circles from sir- cells. This decay was not observed with circles carrying a silencer. These results establish that (i) DNA in transcriptionally silenced chromatin assumes a distinct topology reflecting a distinct organization of silenced versus active chromatin; (ii) the altered chromatin structure in silenced regions likely results from changes in packaging of individual nucleosomes, rather than changes in nucleosome density; and (iii) cell cycle progression disrupts the silenced chromatin structure, a process that is counteracted by silencers.
Significance of genetic association to a marker has been traditionally evaluated through statistics that are standardized such that their null distributions conform to some known ones. Distributional assumptions are often required in this standardization procedure. Based on the observation that the phenotype remains the same regardless of the marker being investigated, we propose a simple statistic that does not need such standardization. We propose a resampling procedure to assess this statistic’s genome-wide significance. This method has been applied to replicate 2 of the Genetic Analysis Workshop 17 simulated data on unrelated individuals in an attempt to map phenotype Q2. However, none of the selected SNPs are in genes that are disease-causing. This may be due to the weak effect that each genetic factor has on Q2.
Much forensic inference based upon DNA evidence is made assuming Hardy-Weinberg Equilibrium (HWE) for the genetic loci being used. Several statistical tests to detect and measure deviation from HWE have been devised, and their limitations become more obvious when testing for deviation within multiallelic DNA loci. The most popular methods-Chi-square and Likelihood-ratio tests-are based on asymptotic results and cannot guarantee a good performance in the presence of low frequency genotypes. Since the parameter space dimension increases at a quadratic rate on the number of alleles, some authors suggest applying sequential methods, where the multiallelic case is reformulated as a sequence of “biallelic” tests. However, in this approach it is not obvious how to assess the general evidence of the original hypothesis; nor is it clear how to establish the significance level for its acceptance/rejection. In this work, we introduce a straightforward method for the multiallelic HWE test, which overcomes the aforementioned issues of sequential methods. The core theory for the proposed method is given by the Full Bayesian Significance Test (FBST), an intuitive Bayesian approach which does not assign positive probabilities to zero measure sets when testing sharp hypotheses. We compare FBST performance to Chi-square, Likelihood-ratio and Markov chain tests, in three numerical experiments. The results suggest that FBST is a robust and high performance method for the HWE test, even in the presence of several alleles and small sample sizes.
Hardy-Weinberg equilibrium; significance tests; FBST
Tumoral tissues tend to generally exhibit aberrations in DNA copy number that are associated with the development and progression of cancer. Genotyping methods such as array-based comparative genomic hybridization (aCGH) provide means to identify copy number variation across the entire genome. To address some of the shortfalls of existing methods of DNA copy number data analysis, including strong model assumptions, lack of accounting for sampling variability of estimators, and the assumption that clones are independent, we propose a simple graphical approach to assess population-level genetic alterations over the entire genome based on moving average. Furthermore, existing methods primarily focus on segmentation and do not examine the association of covariates with genetic instability. In our methods, covariates are incorporated through a possibly mis-specified working model and sampling variabilities of estimators are approximated using a resampling method that is based on perturbing observed processes. Our proposal, which is applicable to partial, entire or multiple chromosomes, is illustrated through application to aCGH studies of two brain tumor types, meningioma and glioma.
aCGH data; Moving average; Perturbation method; Gaussian process; Genomic data
In the eukaryotic genome, transcriptionally silent chromatin tends to propagate along a chromosome and encroach upon adjacent active chromatin. The silencing machinery can be stopped by chromatin boundary elements. We performed a screen in Saccharomyces cerevisiae for proteins that may contribute to the establishment of a chromatin boundary. We found that disruption of histone deacetylase Rpd3p results in defective boundary activity, leading to a Sir-dependent local propagation of transcriptional repression. In rpd3Δ cells, the amount of Sir2p that was normally found in the nucleolus decreased and the amount of Sir2p found at telomeres and at HM and its adjacent loci increased, leading to an extension of silent chromatin in those areas. In addition, Rpd3p interacted directly with chromatin at boundary regions to deacetylate histone H4 at lysine 5 and at lysine 12. Either the mutation of histone H4 at lysine 5 or a decrease in the histone acetyltransferase (HAT) activity of Esa1p abrogated the silencing phenotype associated with rpd3 mutation, suggesting a novel role for the H4 amino terminus in Rpd3p-mediated heterochromatin boundary regulation. Together, these data provide insight into the molecular mechanisms for the anti-silencing functions of Rpd3p during the formation of heterochromatin boundaries.
The use of current high-throughput genetic, genomic and post-genomic data leads to the simultaneous evaluation of a large number of statistical hypothesis and, at the same time, to the multiple-testing problem. As an alternative to the too conservative Family-Wise Error-Rate (FWER), the False Discovery Rate (FDR) has appeared for the last ten years as more appropriate to handle this problem. However one drawback of FDR is related to a given rejection region for the considered statistics, attributing the same value to those that are close to the boundary and those that are not. As a result, the local FDR has been recently proposed to quantify the specific probability for a given null hypothesis to be true.
In this context we present a semi-parametric approach based on kernel estimators which is applied to different high-throughput biological data such as patterns in DNA sequences, genes expression and genome-wide association studies.
The proposed method has the practical advantages, over existing approaches, to consider complex heterogeneities in the alternative hypothesis, to take into account prior information (from an expert judgment or previous studies) by allowing a semi-supervised mode, and to deal with truncated distributions such as those obtained in Monte-Carlo simulations. This method has been implemented and is available through the R package kerfdr via the CRAN or at .
The Epstein-Barr virus (EBV) double-stranded DNA genome is subject to extensive epigenetic regulation. Large consortiums and individual labs have generated a vast number of genome-wide data sets on human lymphoblastoid and other cell lines latently infected with EBV. Analysis of these data sets reveals important new information on the properties of the host and viral chromosome structure organization and epigenetic modifications. We discuss the mapping of these data sets and the subsequent insights into the chromatin structure and transcription factor binding patterns on latent EBV genomes. Colocalization of multiple histone modifications and transcription factors at regulatory loci are considered in the context of the biology and regulation of EBV.
Epstein-Barr virus; gammaherpesvirus; chromatin; histone modification; CTCF; OriP
Fluoroquinolones are an important class of antibiotics for the treatment of infections arising from the gram-positive respiratory pathogen Streptococcus pneumoniae. Although there is evidence supporting interspecific lateral DNA transfer of fluoroquinolone target loci, no studies have specifically been designed to assess the role of intraspecific lateral transfer of these genes in the spread of fluoroquinolone resistance. This study involves a comparative evolutionary perspective, in which the evolutionary history of a diverse set of S. pneumoniae clinical isolates is reconstructed from an expanded multilocus sequence typing data set, with putative recombinants excluded. This control history is then assessed against networks of each of the four fluoroquinolone target loci from the same isolates. The results indicate that although the majority of fluoroquinolone target loci from this set of 60 isolates are consistent with a clonal dissemination hypothesis, 3 to 10% of the sequences are consistent with an intraspecific lateral transfer hypothesis. Also evident were examples of interspecific transfer, with two isolates possessing a parE-parC gene region arising from viridans group streptococci. The Spain 23F-1 clone is the most dominant fluoroquinolone-nonsusceptible clone in this set of isolates, and the analysis suggests that its members act as frequent donors of fluoroquinolone-nonsusceptible loci. Although the majority of fluoroquinolone target gene sequences in this set of isolates can be explained on the basis of clonal dissemination, a significant number are more parsimoniously explained by intraspecific lateral DNA transfer, and in situations of high S. pneumoniae population density, such events could be an important means of resistance spread.
Random monoallelic expression contributes to phenotypic variation of cells and organisms. However, the epigenetic mechanisms by which individual alleles are randomly selected for expression are not known. Taking cues from chromatin signatures at imprinted gene loci such as the insulin-like growth factor 2 gene 2 (IGF2), we evaluated the contribution of CTCF, a zinc finger protein required for parent-of-origin-specific expression of the IGF2 gene, as well as a role for allele-specific association with DNA methylation, histone modification and RNA polymerase II.
Using array-based chromatin immunoprecipitation, we identified 293 genomic loci that are associated with both CTCF and histone H3 trimethylated at lysine 9 (H3K9me3). A comparison of their genomic positions with those of previously published monoallelically expressed genes revealed no significant overlap between allele-specifically expressed genes and colocalized CTCF/H3K9me3. To analyze the contributions of CTCF and H3K9me3 to gene regulation in more detail, we focused on the monoallelically expressed IGF2BP1 gene. In vitro binding assays using the CTCF target motif at the IGF2BP1 gene, as well as allele-specific analysis of cytosine methylation and CTCF binding, revealed that CTCF does not regulate mono- or biallelic IGF2BP1 expression. Surprisingly, we found that RNA polymerase II is detected on both the maternal and paternal alleles in B lymphoblasts that express IGF2BP1 primarily from one allele. Thus, allele-specific control of RNA polymerase II elongation regulates the allelic bias of IGF2BP1 gene expression.
Colocalization of CTCF and H3K9me3 does not represent a reliable chromatin signature indicative of monoallelic expression. Moreover, association of individual alleles with both active (H3K4me3) and silent (H3K27me3) chromatin modifications (allelic bivalent chromatin) or with RNA polymerase II also fails to identify monoallelically expressed gene loci. The selection of individual alleles for expression occurs in part during transcription elongation.
In eukaryotic cells, environmental and developmental signals alter chromatin structure and modulate gene expression. Heterochromatin constitutes the transcriptionally inactive state of the genome and in plants and mammals is generally characterized by DNA methylation and histone modifications such as histone H3 lysine 9 (H3K9) methylation. In Arabidopsis thaliana, DNA methylation and H3K9 methylation are usually colocated and set up a mutually self-reinforcing and stable state. Here, in contrast, we found that SUVR5, a plant Su(var)3–9 homolog with a SET histone methyltransferase domain, mediates H3K9me2 deposition and regulates gene expression in a DNA methylation–independent manner. SUVR5 binds DNA through its zinc fingers and represses the expression of a subset of stimulus response genes. This represents a novel mechanism for plants to regulate their chromatin and transcriptional state, which may allow for the adaptability and modulation necessary to rapidly respond to extracellular cues.
The ability of eukaryotic cells to respond to external stimuli depends on the coordinated activation and repression of specific subsets of genes, often relying on chromatin structure modification. Here, we have characterized a locus-specific mechanism to repress gene expression by the action of an Arabidopsis thaliana SET domain protein, SUVR5, the first example of sequence-dependent heterochromatin initiator in the plant kingdom. Our results suggest that SUVR5 establishes the heterochromatic state by H3K9me2 deposition in a DNA methylation–independent manner that is not perpetuated and thus allows for changes in response to the environment or developmental cues.
Papillomaviruses contain small double-stranded DNA genomes that are maintained in persistently infected mammalian host epithelia as nuclear plasmids and rely upon the host replication machinery for replication. Papillomaviruses encode a DNA helicase, E1, which can specifically bind to the viral genome and support DNA synthesis. Under some conditions in mammalian cells, E1 is not required for viral DNA synthesis, leading to the hypothesis that papillomavirus DNA can be replicated solely by the host replication machinery. This machinery is highly conserved among eukaryotes. We and others found that papillomavirus DNA could replicate in a simple eukaryote, Saccharomyces cerevisiae. Specifically, papillomavirus DNA could substitute for the function of the autonomously replicating sequence (ARS) and centromere (CEN) elements that are normally both required for the stable replication of extrachromosomal DNAs in yeast. Furthermore, this form of replication in yeast was E1 independent. In this study, we map the elements in the human papillomavirus type 16 (HPV16) genome that can substitute for yeast ARS and CEN elements. A single element, termed rep, was identified that can substitute for ARS, and multiple elements, termed mtc, could substitute for CEN. The location of one of these mtc elements overlaps the location of rep, and this approximately 1,000-bp region of HPV16 was sufficient to support stable replication of a bacterial-yeast shuttle plasmid deleted of both ARS and CEN elements.
Resampling algorithms provide an empirical, non-parametric approach to determine the statistical significance of annotations in different experimental settings. ResA3 (Resampling Analysis of Arbitrary Annotations, short: ResA) is a novel tool to facilitate the analysis of enrichment and regulation of annotations deposited in various online resources such as KEGG, Gene Ontology and Pfam or any kind of classification. Results are presented in readily accessible navigable table views together with relevant information for statistical inference. The tool is able to analyze multiple types of annotations in a single run and includes a Gene Ontology annotation feature. We successfully tested ResA using a dataset obtained by measuring incorporation rates of stable isotopes into proteins in intact animals. ResA complements existing tools and will help to evaluate the increasing number of large-scale transcriptomics and proteomics datasets (resa.mpi-bn.mpg.de).
Genes for complex disorders have proven hard to find using linkage analysis. The results rarely reach the desired level of significance and researchers often have failed to replicate positive findings. There is, however, a wealth of information from other scientific approaches which enables the formation of hypotheses on groups of genes or genomic regions likely to be enriched in disease loci. Examples include genes belonging to specific pathways or producing proteins interacting with known risk factors, genes that show altered expression levels in patients or even the group of top scoring locations in a linkage study. We show here that this hypothesis of enrichment for disease loci can be tested using genome-wide linkage data, provided that these data are independent from the data used to generate the hypothesis. Our method is based on the fact that non-parametric linkage analyses are expected to show increased scores at each one of the disease loci, although this increase might not rise above the noise of stochastic variation. By using a summary statistic and calculating its empirical significance, we show that enrichment hypotheses can be tested with power higher than the power of the linkage scan data to identify individual loci. Via simulated linkage scans for a number of different models, we gain insight in the interpretation of genome scan results and test the power of our proposed method. We present an application of the method to real data from a late-onset Alzheimer's disease linkage scan as a proof of principle.
linkage; genome scan; complex disorder; genes; group
One mechanism by which disease-associated DNA variation can alter disease risk is altering gene expression. However, linkage disequilibrium (LD) between variants, mostly single-nucleotide polymorphisms (SNPs), means it is not sufficient to show that a particular variant associates with both disease and expression, as there could be two distinct causal variants in LD. Here, we describe a formal statistical test of colocalization and apply it to type 1 diabetes (T1D)-associated regions identified mostly through genome-wide association studies and expression quantitative trait loci (eQTLs) discovered in a recently determined large monocyte expression data set from the Gutenberg Health Study (1370 individuals), with confirmation sought in an additional data set from the Cardiogenics Transcriptome Study (558 individuals). We excluded 39 out of 60 overlapping eQTLs in 49 T1D regions from possible colocalization and identified 21 coincident eQTLs, representing 21 genes in 14 distinct T1D regions. Our results reflect the importance of monocyte (and their derivatives, macrophage and dendritic cell) gene expression in human T1D and support the candidacy of several genes as causal factors in autoimmune pancreatic beta-cell destruction, including AFF3, CD226, CLECL1, DEXI, FKRP, PRKD2, RNLS, SMARCE1 and SUOX, in addition to the recently described GPR183 (EBI2) gene.
Eukaryotic chromosomes are not randomly distributed in the interphase nucleus, but instead occupy distinct territories. Nonetheless, the genome-wide relationships of gene regulation to gene nuclear location remain poorly understood in yeast. In the three-dimensional view of gene regulation, we found that a considerable number of transcription factors (TFs) regulate genes that are colocalized in the nucleus. Colocalized TF target genes are more strongly coregulated compared with the other TF target genes. Target genes of chromatin regulators are also colocalized. These results demonstrate that colocalization of coregulated genes is a common process, and three-dimensional gene positioning is an important part of gene regulation. Our findings will have implications in understanding nuclear architecture and function.
At present, 51 genes are already known to be responsible for Non-Syndromic hereditary Hearing Loss (NSHL), but the knowledge of 121 NSHL-linked chromosomal regions brings to the hypothesis that a number of disease genes have still to be uncovered. To help scientists to find new NSHL genes, we built a gene-scoring system, integrating Gene Ontology, NCBI Gene and Map Viewer databases, which prioritizes the candidate genes according to their probability to cause NSHL. We defined a set of candidates and measured their functional similarity with respect to the disease gene set, computing a score () that relies on the assumption that functionally related genes might contribute to the same (disease) phenotype. A Kolmogorov-Smirnov test, comparing the pair-wise distribution on the disease gene set with the distribution on the remaining human genes, provided a statistical assessment of this assumption. We found at a p-value that the former pair-wise is greater than the latter, justifying a prioritization strategy based on the functional similarity of candidate genes respect to the disease gene set. A cross-validation test measured to what extent the ranking for NSHL is different from a random ordering: adding 15% of the disease genes to the candidate gene set, the ranking of the disease genes in the first eight positions resulted statistically different from a hypergeometric distribution with a p-value and a power. The twenty top-scored genes were finally examined to evaluate their possible involvement in NSHL. We found that half of them are known to be expressed in human inner ear or cochlea and are mainly involved in remodeling and organization of actin formation and maintenance of the cilia and the endocochlear potential. These findings strongly indicate that our metric was able to suggest excellent NSHL candidates to be screened in patients and controls for causative mutations.
The connection between chromatin nuclear organization and gene activity is vividly illustrated by the observation that transcriptional coregulation of certain genes appears to be directly influenced by their spatial proximity. This fact poses the more general question of whether it is at all feasible that the numerous genes that are coregulated on a given chromosome, especially those at large genomic distances, might become proximate inside the nucleus. This problem is studied here using steered molecular dynamics simulations in order to enforce the colocalization of thousands of knowledge-based gene sequences on a model for the gene-rich human chromosome 19. Remarkably, it is found that most () gene pairs can be brought simultaneously into contact. This is made possible by the low degree of intra-chromosome entanglement and the large number of cliques in the gene coregulatory network. A clique is a set of genes coregulated all together as a group. The constrained conformations for the model chromosome 19 are further shown to be organized in spatial macrodomains that are similar to those inferred from recent HiC measurements. The findings indicate that gene coregulation and colocalization are largely compatible and that this relationship can be exploited to draft the overall spatial organization of the chromosome in vivo. The more general validity and implications of these findings could be investigated by applying to other eukaryotic chromosomes the general and transferable computational strategy introduced here.
Recent high-throughput experiments have shown that chromosome regions (loci) which accommodate specific sets of coregulated genes can be in close spatial proximity despite their possibly large sequence separation. The findings pose the question of whether gene coregulation and gene colocalization are related in general. Here, we tackle this problem using a knowledge-based coarse-grained model of human chromosome 19. Specifically, we carry out steered molecular dynamics simulations to promote the colocalization of hundreds of gene pairs that are known to be significantly coregulated. We show that most () of such pairs can be simultaneously colocalized. This result is, in turn, shown to depend on at least two distinctive chromosomal features: the remarkably low degree of intra-chain entanglement found in chromosomes inside the nucleus and the large number of cliques present in the gene coregulatory network. The results are therefore largely consistent with the coregulation-colocalization hypothesis. Furthermore, the model chromosome conformations obtained by applying the coregulation constraints are found to display spatial macrodomains that have significant similarities with those inferred from HiC measurements of human chromosome 19. This finding suggests that suitable extensions of the present approach might be used to propose viable ensembles of eukaryotic chromosome conformations in vivo.