Transcription factors (TFs) form complexes that bind regulatory modules (RMs) within DNA, to control specific sets of genes. Some transcription factor binding sites (TFBSs) near the transcription start site (TSS) display tight positional preferences relative to the TSS. Furthermore, near the TSS, RMs can co-localize TFBSs with each other and the TSS. The proportion of TFBS positional preferences due to TFBS co-localization within RMs is unknown, however. ChIP experiments confirm co-localization of some TFBSs genome-wide, including near the TSS, but they typically examine only a few TFs at a time, using non-physiological conditions that can vary from lab to lab. In contrast, sequence analysis can examine many TFs uniformly and methodically, broadly surveying the co-localization of TFBSs with tight positional preferences relative to the TSS.
Our statistics found 43 significant sets of human motifs in the JASPAR TF Database with positional preferences relative to the TSS, with 38 preferences tight (±5 bp). Each set of motifs corresponded to a gene group of 135 to 3304 genes, with 42/43 (98%) gene groups independently validated by DAVID, a gene ontology database, with FDR < 0.05. Motifs corresponding to two TFBSs in a RM should co-occur more than by chance alone, enriching the intersection of the gene groups corresponding to the two TFs. Thus, a gene-group intersection systematically enriched beyond chance alone provides evidence that the two TFs participate in an RM. Of the 903 = 43*42/2 intersections of the 43 significant gene groups, we found 768/903 (85%) pairs of gene groups with significantly enriched intersections, with 564/768 (73%) intersections independently validated by DAVID with FDR < 0.05. A user-friendly web site at http://go.usa.gov/3kjsH permits biologists to explore the interaction network of our TFBSs to identify candidate subunit RMs.
Gene duplication and convergent evolution within a genome provide obvious biological mechanisms for replicating an RM near the TSS that binds a particular TF subunit. Of all intersections of our 43 significant gene groups, 85% were significantly enriched, with 73% of the significant enrichments independently validated by gene ontology. The co-localization of TFBSs within RMs therefore likely explains much of the tight TFBS positional preferences near the TSS.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-016-1354-5) contains supplementary material, which is available to authorized users.
Transcription factor binding site; Positional preference; Transcription start site
Chocó is a state located on the Pacific coast of Colombia that has a majority Afro-Colombian population. The objective of this study was to characterize the genetic ancestry, admixture and diversity of the population of Chocó, Colombia.
Genetic variation was characterized for a sample of 101 donors (61 female and 40 male) from the state of Chocó. Genotypes were determined for each individual via the characterization of 610,545 single nucleotide polymorphisms genome-wide. Haplotypes for the uniparental mitochondrial DNA (female) and Y-DNA (male) chromosomes were also determined. These data were used for comparative analyses with a number of worldwide populations, including putative ancestral populations from Africa, the Americas and Europe, along with several admixed American populations.
The population of Chocó has predominantly African genetic ancestry (75.8%) with approximately equal parts European (13.4%) and Native American (11.1%) ancestry. Chocó shows relatively high levels of three-way genetic admixture, and far higher levels of Native American ancestry, compared to other New World African populations from the Caribbean and the United States. There is a striking pattern of sex-specific ancestry in Chocó, with Native American admixture along the female lineage and European admixture along the male lineage. The population of Chocó is also characterized by relatively high levels of overall genetic diversity compared to both putative ancestral populations and other admixed American populations.
These results suggest a unique genetic heritage for the population of Chocó and underscore the profound human genetic diversity that can be found in the region.
Admixture; Afro-Colombian; Colombia; Genetic ancestry; Genetic diversity; Human genome
Pseudomonas fluorescens is a well-known plant growth-promoting rhizobacterium (PGPR). We report here the first whole-genome sequence of PGPR P. fluorescens evaluated in Colombian banana plants. The genome sequences contains genes involved in plant growth and defense, including bacteriocins, 1-aminocyclopropane-1-carboxylic acid (ACC) deaminase, and genes that provide resistance to toxic compounds.
Compaction of DNA into chromatin is a characteristic feature of eukaryotic organisms. The core (H2A, H2B, H3, H4) and linker (H1) histone proteins are responsible for this compaction through the formation of nucleosomes and higher order chromatin aggregates. Moreover, histones are intricately involved in chromatin functioning and provide a means for genome dynamic regulation through specific histone variants and histone post-translational modifications. ‘HistoneDB 2.0 – with variants’ is a comprehensive database of histone protein sequences, classified by histone types and variants. All entries in the database are supplemented by rich sequence and structural annotations with many interactive tools to explore and compare sequences of different variants from various organisms. The core of the database is a manually curated set of histone sequences grouped into 30 different variant subsets with variant-specific annotations. The curated set is supplemented by an automatically extracted set of histone sequences from the non-redundant protein database using algorithms trained on the curated set. The interactive web site supports various searching strategies in both datasets: browsing of phylogenetic trees; on-demand generation of multiple sequence alignments with feature annotations; classification of histone-like sequences and browsing of the taxonomic diversity for every histone variant. HistoneDB 2.0 is a resource for the interactive comparative analysis of histone protein sequences and their implications for chromatin function.
Campylobacter coli is considered one of the main causes of food-borne illness worldwide. We report here the whole-genome sequence of multidrug-resistant Campylobacter coli strain COL B1-266, isolated from the Colombian poultry chain. The genome sequences encode genes for a variety of antimicrobial resistance genes, including aminoglycosides, β-lactams, lincosamides, fluoroquinolones, and tetracyclines.
Campylobacter coli, along with Campylobacter jejuni, is a major agent of gastroenteritis and acute enterocolitis in humans. We report the whole-genome sequences of two multidrug-resistance C. coli strains, isolated from the Colombian poultry chain. The isolates contain a variety of antimicrobial resistance genes for aminoglycosides, lincosamides, fluoroquinolones, and tetracycline.
The genus Physalis is common in the Americas and includes several economically important species, among them Physalis peruviana that produces appetizing edible fruits. We studied the genetic diversity and population structure of P. peruviana and characterized 47 accessions of this species along with 13 accessions of related taxa consisting of 222 individuals from the Colombian Corporation of Agricultural Research (CORPOICA) germplasm collection, using Conserved Orthologous Sequences (COSII) and Immunity Related Genes (IRGs). In addition, 642 Single Nucleotide Polymorphism (SNPs) markers were identified and used for the genetic diversity analysis. A total of 121 alleles were detected in 24 InDels loci ranging from 2 to 9 alleles per locus, with an average of 5.04 alleles per locus. The average number of alleles in the SNP markers was two. The observed heterozygosity for P. peruviana with InDel and SNP markers was higher (0.48 and 0.59) than the expected heterozygosity (0.30 and 0.41). Interestingly, the observed heterozygosity in related taxa (0.4 and 0.12) was lower than the expected heterozygosity (0.59 and 0.25). The coefficient of population differentiation FST was 0.143 (InDels) and 0.038 (SNPs), showing a relatively low level of genetic differentiation among P. peruviana and related taxa. Higher levels of genetic variation were instead observed within populations based on the AMOVA analysis. Population structure analysis supported the presence of two main groups and PCA analysis based on SNP markers revealed two distinct clusters in the P. peruviana accessions corresponding to their state of cultivation. In this study, we identified molecular markers useful to detect genetic variation in Physalis germplasm for assisting conservation and crossbreeding strategies.
Cape gooseberry; germplasm; COSII; IRGs; SNPs; Genetic variation
Bacillus amyloliquefaciens is an important plant growth-promoting rhizobacterium (PGPR). We report the first whole-genome sequence of PGPR Bacillus amyloliquefaciens evaluated in Colombian banana plants. The genome sequences encode genes involved in plant growth and defense, including bacteriocins, ribosomally synthesized antibacterial peptides, in addition to genes that provide resistance to toxic compounds.
Salmonella enterica is a pathogen of significant public health importance that is frequently associated with foodborne illness. We report the whole-genome sequences of four multidrug-resistant Salmonella enterica serovar Paratyphi B and Heidelberg strains, isolated from the Colombian poultry chain. The isolates contain a variety of antimicrobial resistance genes for aminoglycosides, β-lactams, fluoroquinolones, sulfonamides, tetracycline, and trimethoprim.
Mycobacterium colombiense is a novel member of the Mycobacterium avium complex, which produces respiratory and disseminated infections in immunosuppressed patients. Currently, the morphological and genetic bases underlying the phenotypic features of M. colombiense strains remain unknown. In the present study, we demonstrated that M. colombiense strains displaying smooth morphology show increased biofilm formation on hydrophobic surfaces and sliding on motility plates. Thin-layer chromatography experiments showed that M. colombiense strains displaying smooth colonies produce large amounts of glycolipids with a chromatographic behaviour similar to that of the glycopeptidolipids (GPLs) of M. avium. Conversely, we observed a natural rough variant of M. colombiense (57B strain) lacking pigmentation and exhibiting impaired sliding, biofilm formation, and GPL production. Bioinformatics analyses revealed a gene cluster that is likely involved in GPL biosynthesis in M. colombiense CECT 3035. RT-qPCR experiments showed that motile culture conditions activate the transcription of genes possibly involved in key enzymatic activities of GPL biosynthesis.
Vibrio navarrensis is an aquatic bacterium recently shown to be associated with human illness. We report the first genome sequences of three V. navarrensis strains obtained from clinical and environmental sources. Preliminary analyses of the sequences reveal that V. navarrensis contains genes commonly associated with virulence in other human pathogens.
Some biological sequences contain subsequences of unusual composition, e.g., some proteins contain DNA binding domains, transmembrane regions, and charged regions; and some DNA sequences contain repeats. Requiring time linear in the length of an input sequence, the Ruzzo-Tompa (RT) Algorithm finds subsequences of unusual composition, using a sequence of scores as input and the corresponding “maximal segments” as output. (Loosely, maximal segments are the contiguous subsequences having greatest total score.) Just as gaps improved the sensitivity of BLAST, in principle gaps could help tune other tools, to improve sensitivity when searching for subsequences of unusual composition.
Call a graph whose vertices are totally ordered a “totally ordered graph”. In a totally ordered graph, call a path whose vertices are in increasing order an “increasing path”. The input of the RT Algorithm can be generalized to a finite, totally ordered, weighted graph, so the algorithm then locates maximal segments, corresponding to increasing paths of maximal weight. The generalization permits penalized deletion of unfavorable letters from contiguous subsequences, so the generalized Ruzzo-Tompa algorithm can find subsequences with greatest total gapped scores. The search for inexact simple repeats in DNA exemplifies some of the concepts. For some limited types of repeats, RepWords, a repeat-finding tool based on the principled use of the Ruzzo-Tompa algorithm, performed better than a similar extant tool.
With minimal programming effort, the generalization of the Ruzzo-Tompa algorithm given in this article could improve the performance of many programs for finding biological subsequences of unusual composition.
Mammalian-wide interspersed repeats (MIRs) are the most ancient family of transposable elements (TEs) in the human genome. The deep conservation of MIRs initially suggested the possibility that they had been exapted to play functional roles for their host genomes. MIRs also happen to be the only TEs whose presence in-and-around human genes is positively correlated to tissue-specific gene expression. Similar associations of enhancer prevalence within genes and tissue-specific expression, along with MIRs’ previous implication as providing regulatory sequences, suggested a possible link between MIRs and enhancers.
To test the possibility that MIRs contribute functional enhancers to the human genome, we evaluated the relationship between MIRs and human tissue-specific enhancers in terms of genomic location, chromatin environment, regulatory function, and mechanistic attributes. This analysis revealed MIRs to be highly concentrated in enhancers of the K562 and HeLa human cell-types. Significantly more enhancers were found to be linked to MIRs than would be expected by chance, and putative MIR-derived enhancers are characterized by a chromatin environment highly similar to that of canonical enhancers. MIR-derived enhancers show strong associations with gene expression levels, tissue-specific gene expression and tissue-specific cellular functions, including a number of biological processes related to erythropoiesis. MIR-derived enhancers were found to be a rich source of transcription factor binding sites, underscoring one possible mechanistic route for the element sequences co-option as enhancers. There is also tentative evidence to suggest that MIR-enhancer function is related to the transcriptional activity of non-coding RNAs.
Taken together, these data reveal enhancers to be an important cis-regulatory platform from which MIRs can exercise a regulatory function in the human genome and help to resolve a long-standing conundrum as to the reason for MIRs’ deep evolutionary conservation.
The genus Mycobacterium comprises more than 150 species, including important pathogens for humans which cause major public health problems. The vast majority of efforts to understand the genus have been addressed in studies with Mycobacterium tuberculosis. The biological differentiation between M. tuberculosis and nontuberculous mycobacteria (NTM) is important because there are distinctions in the sources of infection, treatments, and the course of disease. Likewise, the importance of studying NTM is not only due to its clinical significance but also due to the mechanisms by which some species are pathogenic while others are not. Mycobacterium avium complex (MAC) is the most important group of NTM opportunistic pathogens, since it is the second largest medical complex in the genus after the M. tuberculosis complex. Here, we evaluated the virulence and immune response of M. avium subsp. avium and Mycobacterium colombiense, using experimental models of progressive pulmonary tuberculosis and subcutaneous infection in BALB/c mice. Mice infected intratracheally with a high dose of MAC strains showed high expression of tumor necrosis factor alpha (TNF-α) and inducible nitric oxide synthase with rapid bacillus elimination and numerous granulomas, but without lung consolidation during late infection in coexistence with high expression of anti-inflammatory cytokines. In contrast, subcutaneous infection showed high production of the proinflammatory cytokines TNF-α and gamma interferon with relatively low production of anti-inflammatory cytokines such as interleukin-10 (IL-10) or IL-4, which efficiently eliminate the bacilli but maintain extensive inflammation and fibrosis. Thus, MAC infection evokes different immune and inflammatory responses depending on the MAC species and affected tissue.
Chromatin structure plays a key role in regulating gene expression and embryonic differentiation; however, the factors that determine the organization of chromatin around regulatory sites are not fully known. Here we show that HMGN1, a nucleosome-binding protein ubiquitously expressed in vertebrate cells, preferentially binds to CpG island-containing promoters and affects the organization of nucleosomes, DNase I hypersensitivity, and the transcriptional profile of mouse embryonic stem cells and neural progenitors. Loss of HMGN1 alters the organization of an unstable nucleosome at transcription start sites, reduces the number of DNase I-hypersensitive sites genome wide, and decreases the number of nestin-positive neural progenitors in the subventricular zone (SVZ) region of mouse brain. Thus, architectural chromatin-binding proteins affect the transcription profile and chromatin structure during embryonic stem cell differentiation.
The Cape gooseberry (Physalisperuviana L) is an Andean exotic fruit with high nutritional value and appealing medicinal properties. However, its cultivation faces important phytosanitary problems mainly due to pathogens like Fusarium oxysporum, Cercosporaphysalidis and Alternaria spp. Here we used the Cape gooseberry foliar transcriptome to search for proteins that encode conserved domains related to plant immunity including: NBS (Nucleotide Binding Site), CC (Coiled-Coil), TIR (Toll/Interleukin-1 Receptor). We identified 74 immunity related gene candidates in P. peruviana which have the typical resistance gene (R-gene) architecture, 17 Receptor like kinase (RLKs) candidates related to PAMP-Triggered Immunity (PTI), eight (TIR-NBS-LRR, or TNL) and nine (CC–NBS-LRR, or CNL) candidates related to Effector-Triggered Immunity (ETI) genes among others. These candidate genes were categorized by molecular function (98%), biological process (85%) and cellular component (79%) using gene ontology. Some of the most interesting predicted roles were those associated with binding and transferase activity. We designed 94 primers pairs from the 74 immunity-related genes (IRGs) to amplify the corresponding genomic regions on six genotypes that included resistant and susceptible materials. From these, we selected 17 single band amplicons and sequenced them in 14 F. oxysporum resistant and susceptible genotypes. Sequence polymorphisms were analyzed through preliminary candidate gene association, which allowed the detection of one SNP at the PpIRG-63 marker revealing a nonsynonymous mutation in the predicted LRR domain suggesting functional roles for resistance.
We report the first genome sequences for six strains of Rhodanobacter species isolated from a variety of soil and subsurface environments. Three of these strains are capable of complete denitrification and three others are not. However, all six strains contain most of the genes required for the respiration of nitrate to gaseous nitrogen. The nondenitrifying members of the genus lack only the gene for nitrate reduction, the first step in the full denitrification pathway. The data suggest that the environmental role of bacteria from the genus Rhodanobacter should be reevaluated.
P-type ATPases hydrolyze ATP and release energy that is used in the transport of ions against electrochemical gradients across plasma membranes, making these proteins essential for cell viability. Currently, the distribution and function of these ion transporters in mycobacteria are poorly understood.
In this study, probabilistic profiles were constructed based on hidden Markov models to identify and classify P-type ATPases in the Mycobacterium tuberculosis complex (MTBC) according to the type of ion transported across the plasma membrane. Topology, hydrophobicity profiles and conserved motifs were analyzed to correlate amino acid sequences of P-type ATPases and ion transport specificity. Twelve candidate P-type ATPases annotated in the M. tuberculosis H37Rv proteome were identified in all members of the MTBC, and probabilistic profiles classified them into one of the following three groups: heavy metal cation transporters, alkaline and alkaline earth metal cation transporters, and the beta subunit of a prokaryotic potassium pump. Interestingly, counterparts of the non-catalytic beta subunits of Hydrogen/Potassium and Sodium/Potassium P-type ATPases were not found.
The high content of heavy metal transporters found in the MTBC suggests that they could play an important role in the ability of M. tuberculosis to survive inside macrophages, where tubercle bacilli face high levels of toxic metals. Finally, the results obtained in this work provide a starting point for experimental studies that may elucidate the ion specificity of the MTBC P-type ATPases and their role in mycobacterial infections.
Tuberculosis; Mycobacterium tuberculosis complex; P-type ATPases; Ion transport; Conserved motifs
Understanding gene regulation is a major objective in molecular biology research. Frequently, transcription is driven by transcription factors (TFs) that bind to specific DNA sequences. These motifs are usually short and degenerate, rendering the likelihood of multiple copies occurring throughout the genome due to random chance as high. Despite this, TFs only bind to a small subset of sites, thus prompting our investigation into the differences between motifs that are bound by TFs and those that remain unbound. Here we constructed vectors representing various chromatin- and sequence-based features for a published set of bound and unbound motifs representing nine TFs in the budding yeast Saccharomyces cerevisiae. Using a machine learning approach, we identified a set of features that can be used to discriminate between bound and unbound motifs. We also discovered that some TFs bind most or all of their strong motifs in intergenic regions. Our data demonstrate that local sequence context can be strikingly different around motifs that are bound compared to motifs that are unbound. We concluded that there are multiple combinations of genomic features that characterize bound or unbound motifs.
Gene regulation; yeast; transcription factors; genomic features; machine learning
This chapter describes a workflow for measuring the efficacy of a barcode in identifying species. First, assemble individual sequence databases corresponding to each barcode marker. A controlled collection of taxonomic data is preferable to GenBank data, because GenBank data can be problematic, particularly when comparing barcodes based on more than one marker. To ensure proper controls when evaluating species identification, specimens not having a sequence in every marker database should be discarded. Second, select a computer algorithm for assigning species to barcode sequences. No algorithm has yet improved notably on assigning a specimen to the species of its nearest neighbor within a barcode database. Because global sequence alignments (e.g., with the Needleman–Wunsch algorithm, or some related algorithm) examine entire barcode sequences, they generally produce better species assignments than local sequence alignments (e.g., with BLAST). No neighboring method (e.g., global sequence similarity, global sequence distance, or evolutionary distance based on a global alignment) has yet shown a notable superiority in identifying species. Finally, “the probability of correct identification” (PCI) provides an appropriate measurement of barcode efficacy. The overall PCI for a data set is the average of the species PCIs, taken over all species in the data set. This chapter states explicitly how to calculate PCI, how to estimate its statistical sampling error, and how to use data on PCR failure to set limits on how much improvements in PCR technology can improve species identification.
Barcode efficacy in species identification; Probability of correct identification; DNA barcode
Physalis peruviana commonly known as Cape gooseberry is a member of the Solanaceae family that has an increasing popularity due to its nutritional and medicinal values. A broad range of genomic tools is available for other Solanaceae, including tomato and potato. However, limited genomic resources are currently available for Cape gooseberry.
We report the generation of a total of 652,614 P. peruviana Expressed Sequence Tags (ESTs), using 454 GS FLX Titanium technology. ESTs, with an average length of 371 bp, were obtained from a normalized leaf cDNA library prepared using a Colombian commercial variety. De novo assembling was performed to generate a collection of 24,014 isotigs and 110,921 singletons, with an average length of 1,638 bp and 354 bp, respectively. Functional annotation was performed using NCBI’s BLAST tools and Blast2GO, which identified putative functions for 21,191 assembled sequences, including gene families involved in all the major biological processes and molecular functions as well as defense response and amino acid metabolism pathways. Gene model predictions in P. peruviana were obtained by using the genomes of Solanum lycopersicum (tomato) and Solanum tuberosum (potato). We predict 9,436 P. peruviana sequences with multiple-exon models and conserved intron positions with respect to the potato and tomato genomes. Additionally, to study species diversity we developed 5,971 SSR markers from assembled ESTs.
We present the first comprehensive analysis of the Physalis peruviana leaf transcriptome, which will provide valuable resources for development of genetic tools in the species. Assembled transcripts with gene models could serve as potential candidates for marker discovery with a variety of applications including: functional diversity, conservation and improvement to increase productivity and fruit quality. P. peruviana was estimated to be phylogenetically branched out before the divergence of five other Solanaceae family members, S. lycopersicum, S. tuberosum, Capsicum spp, S. melongena and Petunia spp.
P. peruviana; Solanaceae; ESTs; Functional annotation; Gene model; Phylogenetics
We report the first whole-genome sequence of the Mycobacterium colombiense type strain, CECT 3035, which was initially isolated from Colombian HIV-positive patients and causes respiratory and disseminated infections. Preliminary comparative analyses indicate that the M. colombiense lineage has experienced a substantial genome expansion, possibly contributing to its distinct pathogenic capacity.
The urea cycle converts toxic ammonia to urea within the liver of mammals. At least 6 enzymes are required for ureagenesis, which correlates with dietary protein intake. The transcription of urea cycle genes is, at least in part, regulated by glucocorticoid and glucagon hormone signaling pathways. N-acetylglutamate synthase (NAGS) produces a unique cofactor, N-acetylglutamate (NAG), that is essential for the catalytic function of the first and rate-limiting enzyme of ureagenesis, carbamyl phosphate synthetase 1 (CPS1). However, despite the important role of NAGS in ammonia removal, little is known about the mechanisms of its regulation. We identified two regions of high conservation upstream of the translation start of the NAGS gene. Reporter assays confirmed that these regions represent promoter and enhancer and that the enhancer is tissue specific. Within the promoter, we identified multiple transcription start sites that differed between liver and small intestine. Several transcription factor binding motifs were conserved within the promoter and enhancer regions while a TATA-box motif was absent. DNA-protein pull-down assays and chromatin immunoprecipitation confirmed binding of Sp1 and CREB, but not C/EBP in the promoter and HNF-1 and NF-Y, but not SMAD3 or AP-2 in the enhancer. The functional importance of these motifs was demonstrated by decreased transcription of reporter constructs following mutagenesis of each motif. The presented data strongly suggest that Sp1, CREB, HNF-1, and NF-Y, that are known to be responsive to hormones and diet, regulate NAGS transcription. This provides molecular mechanism of regulation of ureagenesis in response to hormonal and dietary changes.
Merozoite surface protein 1 (MSP-1) is a polymorphic malaria protein with functional domains involved in parasite erythrocyte interaction. Plasmodium vivax MSP-1 has a fragment (Pv200L) that has been identified as a potential subunit vaccine because it is highly immunogenic and induces partial protection against infectious parasite challenge in vaccinated monkeys. To determine the extent of genetic polymorphism and its effect on the translated protein, we sequenced the Pv200L coding region from isolates of 26 P. vivax-infected patients in a malaria-endemic area of Colombia. The extent of nucleotide diversity (π) in these isolates (0.061 ± 0.004) was significantly lower (P ≤ 0.001) than that observed in Thai and Brazilian isolates; 0.083 ± 0.006 and 0.090 ± 0.006, respectively. We found two new alleles and several previously unidentified dimorphic substitutions and significant size polymorphism. The presence of highly conserved blocks in this fragment has important implications for the development of Pv200L as a subunit vaccine candidate.
Meiotic recombination is not distributed uniformly throughout the genome. There are regions of high and low recombination rates called hot and cold spots, respectively. The recombination rate parallels the frequency of DNA double-strand breaks (DSBs) that initiate meiotic recombination. The aim is to identify biological features associated with DSB frequency. We constructed vectors representing various chromatin and sequence-based features for 1179 DSB hot spots and 1028 DSB cold spots. Using a feature selection approach, we have identified five features that distinguish hot from cold spots in Saccharomyces cerevisiae with high accuracy, namely the histone marks H3K4me3, H3K14ac, H3K36me3, and H3K79me3; and GC content. Previous studies have associated H3K4me3, H3K36me3, and GC content with areas of mitotic recombination. H3K14ac and H3K79me3 are novel predictions and thus represent good candidates for further experimental study. We also show nucleosome occupancy maps produced using next generation sequencing exhibit a bias at DSB hot spots and this bias is strong enough to obscure biologically relevant information. A computational approach using feature selection can productively be used to identify promising biological associations. H3K14ac and H3K79me3 are novel predictions of chromatin marks associated with meiotic DSBs. Next generation sequencing can exhibit a bias that is strong enough to lead to incorrect conclusions. Care must be taken when interpreting high throughput sequencing data where systematic biases have been documented.