The single nucleotide polymorphism rs2071746 and a (GT)n microsatellite within the human gene encoding heme oxygenase-1 (HMOX1) are associated with incidence or outcome in a variety of diseases. Most of these associations involve either release of heme or oxidative stress. Both polymorphisms are localized in the promoter region, but previously reported correlations with heme oxygenase-1 expression remain not coherent. This ambiguity suggests a more complex organization of the 5’ gene region which we sought to investigate more fully.
We evaluated the 5‘ end of HMOX1 and found a novel first exon 1a placing the two previously reported polymorphisms in intronic or exonic positions within the 5’ untranslated region respectively. Expression of exon 1a can be induced in HepG2 hepatoma cells by hemin and is a repressor of heme oxygenase-1 translation as shown by luciferase reporter assays. Moreover, minigene approaches revealed that the quantitative outcome of alternative splicing within the 5’ untranslated region is affected by the (GT)n microsatellite.
This data supporting an extended HMOX1 gene model and provide further insights into expression regulation of heme oxygenase-1. Alternative splicing within the HMOX1 5' untranslated region contributes to translational regulation and is a mechanistic feature involved in the interplay between genetic variations, heme oxygenase-1 expression and disease outcome.
Oligodendroglial tumors form a distinct subgroup of gliomas, characterized by a better response to treatment and prolonged overall survival. Most oligodendrogliomas and also some oligoastrocytomas are characterized by a unique and typical unbalanced translocation, der(1,19), resulting in a 1p/19q co-deletion. Candidate tumor suppressor genes targeted by these losses, CIC on 19q13.2 and FUBP1 on 1p31.1, were only recently discovered. We analyzed 17 oligodendrogliomas and oligoastrocytomas by applying a comprehensive approach consisting of RNA expression analysis, DNA sequencing of CIC, FUBP1, IDH1/2, and array CGH. We confirmed three different genetic subtypes in our samples: i) the “oligodendroglial” subtype with 1p/19q co-deletion in twelve out of 17 tumors; ii) the “astrocytic” subtype in three tumors; iii) the “other” subtype in two tumors. All twelve tumors with the 1p/19q co-deletion carried the most common IDH1 R132H mutation. In seven of these tumors, we found protein-disrupting point mutations in the remaining allele of CIC, four of which are novel. One of these tumors also had a deleterious mutation in FUBP1. Only by integrating RNA expression and array CGH data, were we able to discover an exon-spanning homozygous microdeletion within the remaining allele of CIC in an additional tumor with 1p/19q co-deletion. Therefore we propose that the mutation rate might be underestimated when looking at sequence variants alone. In conclusion, the high frequency and the spectrum of CIC mutations in our 1p/19q-codeleted tumor cohort support the hypothesis that CIC acts as a tumor suppressor in these tumors, whereas FUBP1 might play only a minor role.
There is growing evidence for the prevalence of copy number variation (CNV) and its role in phenotypic variation in many eukaryotic species. Here we use array comparative genomic hybridization to explore the extent of this type of structural variation in domesticated barley cultivars and wild barleys.
A collection of 14 barley genotypes including eight cultivars and six wild barleys were used for comparative genomic hybridization. CNV affects 14.9% of all the sequences that were assessed. Higher levels of CNV diversity are present in the wild accessions relative to cultivated barley. CNVs are enriched near the ends of all chromosomes except 4H, which exhibits the lowest frequency of CNVs. CNV affects 9.5% of the coding sequences represented on the array and the genes affected by CNV are enriched for sequences annotated as disease-resistance proteins and protein kinases. Sequence-based comparisons of CNV between cultivars Barke and Morex provided evidence that DNA repair mechanisms of double-strand breaks via single-stranded annealing and synthesis-dependent strand annealing play an important role in the origin of CNV in barley.
We present the first catalog of CNVs in a diploid Triticeae species, which opens the door for future genome diversity research in a tribe that comprises the economically important cereal species wheat, barley, and rye. Our findings constitute a valuable resource for the identification of CNV affecting genes of agronomic importance. We also identify potential mechanisms that can generate variation in copy number in plant genomes.
Barley, Copy number variation; Comparative genomic hybridization; Disease-resistance genes; Double-strand break repair mechanisms
The structure of the human gut microbial community is determined by host genetics and environmental factors, where alterations in its structure have been associated with the onset of different diseases. Establishing a defined human gut microbial community within inbred rodent models provides a means to study microbial-related pathologies, however, an in-depth comparison of the established human gut microbiota in the different models is lacking. We compared the efficiency of establishing the bacterial component of a defined human microbial community within germ-free (GF) rats, GF mice and antibiotic-treated specific pathogen-free mice. Remarkable differences were observed between the different rodent models. While the majority of abundant human-donor bacterial phylotypes were established in the GF rats, only a subset was present in the GF mice. Despite the fact that members of the phylum Bacteriodetes were well established in all rodent models, mice enriched for phylotypes related to species of Bacteroides. In contrary to the efficiency of Clostridiales to populate the GF rat in relative proportions to that of the human-donor, members of Clostridia cluster IV only poorly colonize the mouse gut. Thus, the genetic background of the different recipient rodent systems (that is, rats and mice) strongly influences the nature of the populating human gut microbiota, determining each model’s biological suitability.
Keywords: human intestinal microbiota; bacterial community; germ-free; gnotobiotic; rats and mice; T-RFLP; 454-pyrosequencing; F/B ratio; multivariate statistical analysis
The African annual fish Nothobranchius furzeri has over recent years been established as a model species for ageing-related studies. This is mainly based on its exceptionally short lifespan and the presence of typical characteristics of vertebrate ageing. To substantiate its role as an alternative vertebrate ageing model, a transcript catalogue is needed, which can serve e.g. as basis for identifying ageing-related genes.
To build the N. furzeri transcript catalogue, thirteen cDNA libraries were sequenced using Sanger, 454/Roche and Solexa/Illumina technologies yielding about 39 Gb. In total, 19,875 protein-coding genes were identified and annotated. Of these, 71% are represented by at least one transcript contig with a complete coding sequence. Further, transcript levels of young and old fish of the strains GRZ and MZM-0403, which differ in lifespan by twofold, were studied by RNA-seq. In skin and brain, 85 differentially expressed genes were detected; these have a role in cell cycle control and proliferation, inflammation and tissue maintenance. An RNA-seq experiment for zebrafish skin confirmed the ageing-related relevance of the findings in N. furzeri. Notably, analyses of transcript levels between zebrafish and N. furzeri but also between N. furzeri strains differed largely, suggesting that ageing is accelerated in the short-lived N. furzeri strain GRZ compared to the longer-lived strain MZM-0403.
We provide a comprehensive, annotated N. furzeri transcript catalogue and a first transcriptome-wide insight into N. furzeri ageing. This data will serve as a basis for future functional studies of ageing-related genes.
Nothobranchius furzeri; Model fish species; Ageing; Transcriptome assembly; Transcript catalogue; Gene expression; RNA-seq
Many sequence data repositories can give a quick and easily accessible overview on genomes and their annotations. Less widespread is the possibility to compare related genomes with each other in a common database environment. We have previously described the GenColors database system (http://gencolors.fli-leibniz.de) and its applications to a number of bacterial genomes such as Borrelia, Legionella, Leptospira and Treponema. This system has an emphasis on genome comparison. It combines data from related genomes and provides the user with an extensive set of visualization and analysis tools. Eukaryote genomes are normally larger than prokaryote genomes and thus pose additional challenges for such a system. We have, therefore, adapted GenColors to also handle larger datasets of small eukaryotic genomes and to display eukaryotic gene structures. Further recent developments include whole genome views, genome list options and, for bacterial genome browsers, the display of horizontal gene transfer predictions. Two new GenColors-based databases for two fungal species (http://fgb.fli-leibniz.de) and for four social amoebas (http://sacgb.fli-leibniz.de) were set up. Both new resources open up a single entry point for related genomes for the amoebozoa and fungal research communities and other interested users. Comparative genomics approaches are greatly facilitated by these resources.
Human ß-defensins are a family of antimicrobial peptides located at the mucosal surface. Both sequence multi-site variations (MSV) and copy-number variants (CNV) of the defensin-encoding genes are associated with increased risk for various diseases, including cancer and inflammatory conditions such as psoriasis and acute pancreatitis. In a case–control study, we investigated the association between MSV in DEFB104 as well as defensin gene (DEF) cluster copy number (CN), and pancreatic ductal adenocarcinoma (PDAC) and chronic pancreatitis (CP).
Two groups of PDAC (N=70) and CP (N=60) patients were compared to matched healthy control groups CARLA1 (N=232) and CARLA2 (N=160), respectively. Four DEFB104 MSV were haplotyped by PCR, cloning and sequencing. DEF cluster CN was determined by multiplex ligation-dependent probe amplification.
Neither the PDAC nor the CP cohorts show significant differences in the DEFB104 haplotype distribution compared to the respective control groups CARLA1 and CARLA2, respectively.
The diploid DEF cluster CN exhibit a significantly different distribution between PDAC and CARLA1 (Fisher’s exact test P=0.027), but not between CP and CARLA2 (P=0.867).
Different DEF cluster b CN distribution between PDAC patients and healthy controls indicate a potential protective effect of higher CNs against the disease.
Defensins; Single nucleotide variants; Copy number variation; Chronic pancreatitis; Pancreatic ductal adenocarcinoma
Genotyping of 21 varicella-zoster virus (VZV) strains using a scattered single nucleotide polymorphism (SNP) method revealed ambiguous SNPs and two nontypeable isolates. For a further genetic characterization, the genomes of all strains were sequenced using the 454 technology. Almost-complete genome sequences were assembled, and most remaining gaps were closed with Sanger sequencing. Phylogenetic analysis of 42 genomes revealed five established and two novel VZV genotypes, provisionally termed VIII and IX. Genotypes VIII and IX are distinct from the previously reported provisional genotypes VI and VII as judged from the SNP pattern. The alignments showed evidence of ancient recombination events in the phylogeny of clade 4 and recent recombinations within single strains: 3/2005 (clade 1), 11 and 405/2007 (clade 3), 8 and DR (clade 4), CA123 and 413/2000 (clade 5), and strains of the novel genotypes VIII and IX. Bayesian tree inference of the thymidine kinase and the polymerase genes of the VZV clades and other varicelloviruses revealed that VZV radiation began some 110,000 years ago, which correlates with the out-of-Africa dispersal of modern humans. The split of ancestral clades 2/4 and 1/3/5/VIII/IX shows the greatest node height.
The purpose of the study is to elucidate the sequence composition of the short arm of rye chromosome 1 (Secale cereale) with special focus on its gene content, because this portion of the rye genome is an integrated part of several hundreds of bread wheat varieties worldwide.
Multiple Displacement Amplification of 1RS DNA, obtained from flow sorted 1RS chromosomes, using 1RS ditelosomic wheat-rye addition line, and subsequent Roche 454FLX sequencing of this DNA yielded 195,313,589 bp sequence information. This quantity of sequence information resulted in 0.43× sequence coverage of the 1RS chromosome arm, permitting the identification of genes with estimated probability of 95%. A detailed analysis revealed that more than 5% of the 1RS sequence consisted of gene space, identifying at least 3,121 gene loci representing 1,882 different gene functions. Repetitive elements comprised about 72% of the 1RS sequence, Gypsy/Sabrina (13.3%) being the most abundant. More than four thousand simple sequence repeat (SSR) sites mostly located in gene related sequence reads were identified for possible marker development. The existence of chloroplast insertions in 1RS has been verified by identifying chimeric chloroplast-genomic sequence reads. Synteny analysis of 1RS to the full genomes of Oryza sativa and Brachypodium distachyon revealed that about half of the genes of 1RS correspond to the distal end of the short arm of rice chromosome 5 and the proximal region of the long arm of Brachypodium distachyon chromosome 2. Comparison of the gene content of 1RS to 1HS barley chromosome arm revealed high conservation of genes related to chromosome 5 of rice.
The present study revealed the gene content and potential gene functions on this chromosome arm and demonstrated numerous sequence elements like SSRs and gene-related sequences, which can be utilised for future research as well as in breeding of wheat and rye.
A protein named AAH was isolated from the bacterium Microbacterium arborescens SE14, a gut commensal of the lepidopteran larvae. It showed not only a high sequence similarity to Dps-like proteins (DNA-binding proteins from starved cell) but also reversible hydrolase activity. A comparative genomic analysis was performed to gain more insights into its evolution. The GC profile of the aah gene indicated that it was evolved from a low GC ancestor. Its stop codon usage was also different from the general pattern of Actinobacterial genomes. The phylogeny of dps-like proteins showed strong correlation with the phylogeny of host bacteria. A conserved genomic synteny was identified in some taxonomically related Actinobacteria, suggesting that the ancestor genes had incorporated into the genome before the divergence of Micrococcineae from other families. The aah gene had evolved new function but still retained the typical dodecameric structure.
The naked mole-rat (Heterocephalus glaber) is a long-lived, cancer resistant rodent and there is a great interest in identifying the adaptations responsible for these and other of its unique traits. We employed RNA sequencing to compare liver gene expression profiles between naked mole-rats and wild-derived mice. Our results indicate that genes associated with oxidoreduction and mitochondria were expressed at higher relative levels in naked mole-rats. The largest effect is nearly 300-fold higher expression of epithelial cell adhesion molecule (Epcam), a tumour-associated protein. Also of interest are the protease inhibitor, alpha2-macroglobulin (A2m), and the mitochondrial complex II subunit Sdhc, both ageing-related genes found strongly over-expressed in the naked mole-rat. These results hint at possible candidates for specifying species differences in ageing and cancer, and in particular suggest complex alterations in mitochondrial and oxidation reduction pathways in the naked mole-rat. Our differential gene expression analysis obviated the need for a reference naked mole-rat genome by employing a combination of Illumina/Solexa and 454 platforms for transcriptome sequencing and assembling transcriptome contigs of the non-sequenced species. Overall, our work provides new research foci and methods for studying the naked mole-rat's fascinating characteristics.
Next generation sequencing of BACs is a viable option for deciphering the sequence of even large and highly repetitive genomes. In order to optimize this strategy, we examined the influence of read length on the quality of Roche/454 sequence assemblies, to what extent Illumina/Solexa mate pairs (MPs) improve the assemblies by scaffolding and whether barcoding of BACs is dispensable.
Sequencing four BACs with both FLX and Titanium technologies revealed similar sequencing accuracy, but showed that the longer Titanium reads produce considerably less misassemblies and gaps. The 454 assemblies of 96 barcoded BACs were improved by scaffolding 79% of the total contig length with MPs from a non-barcoded library.
Assembly of the unmasked 454 sequences without separation by barcodes revealed chimeric contig formation to be a major problem, encompassing 47% of the total contig length. Masking the sequences reduced this fraction to 24%.
Optimal BAC pool sequencing should be based on the longest available reads, with barcoding essential for a comprehensive assessment of both repetitive and non-repetitive sequence information. When interest is restricted to non-repetitive regions and repeats are masked prior to assembly, barcoding is non-essential. In any case, the assemblies can be improved considerably by scaffolding with non-barcoded BAC pool MPs.
BAC pools; next generation sequencing; 454; Illumina; barcoding; mate pairs; scaffolding; barley
In HIV infection, TLR7-triggered IFN-α production exerts a direct antiviral effect through the inhibition of viral replication, but may also be involved in immune pathogenesis leading to AIDS. TLR7 could also be an important mediator of vaccine efficacy. In this study, we analyzed polymorphisms in the X-linked TLR7 gene in the rhesus macaque model of AIDS. Upon resequencing of the TLR7 gene in 36 rhesus macaques of Indian origin, 12 polymorphic sites were detected. Next, we identified three tightly linked single nucleotide polymorphisms (SNP) as being associated with survival time. Genotyping of 119 untreated, simian immunodeficiency virus (SIV)-infected male rhesus macaques, including an ‘MHC adjusted’ subset, revealed that the three TLR7 SNPs are also significantly associated with set-point viral load. Surprisingly, this effect was not observed in 72 immunized SIV-infected male monkeys. We hypothesize (i) that SNP c.13G>A in the leader peptide is causative for the observed genotype-phenotype association and that (ii) the underlying mechanism is related to RNA secondary structure formation. Therefore, we investigated a fourth SNP (c.-17C>T), located 17 bp upstream of the ATG translation initiation codon, that is also potentially capable of influencing RNA structure. In c.13A carriers, neither set-point viral load nor survival time were related to the c.-17C>T genotype. In c.13G carriers, by contrast, the c.-17C allele was significantly associated with prolonged survival. Again, no such association was detected among immunized SIV-infected macaques. Our results highlight the dual role of TLR7 in immunodeficiency virus infection and vaccination and imply that it may be important to control human AIDS vaccine trials, not only for MHC genotype, but also for TLR7 genotype.
Cytosine methylation provides an epigenetic level of cellular plasticity that is important for development, differentiation and cancerogenesis. We adopted microdroplet PCR to bisulfite treated target DNA in combination with second generation sequencing to simultaneously assess DNA sequence and methylation. We show measurement of methylation status in a wide range of target sequences (total 34 kb) with an average coverage of 95% (median 100%) and good correlation to the opposite strand (rho = 0.96) and to pyrosequencing (rho = 0.87). Data from lymphoma and colorectal cancer samples for SNRPN (imprinted gene), FGF6 (demethylated in the cancer samples) and HS3ST2 (methylated in the cancer samples) serve as a proof of principle showing the integration of SNP data and phased DNA-methylation information into “hepitypes” and thus the analysis of DNA methylation phylogeny in the somatic evolution of cancer.
Melanin-concentrating hormone receptor 1 (MCHR1) plays a significant role in regulation of energy balance, food intake, physical activity and body weight in humans and rodents. Several association studies for human obesity showed contrary results concerning the SNPs rs133072 (G/A) and rs133073 (T/C), which localize to the first exon of MCHR1. The variations constitute two main haplotypes (GT, AC). Both SNPs affect CpG dinucleotides, whereby each haplotype contains a potential methylation site at one of the two SNP positions. In addition, 15 CpGs in close vicinity of these SNPs constitute a weak CpG island. Here, we studied whether DNA methylation in this sequence context may contribute to population- and age-specific effects of MCHR1 alleles in obesity.
We analyzed DNA methylation of a 315 bp region of MCHR1 encompassing rs133072 and rs133073 and the CpG island in blood samples of 49 individuals by bisulfite sequencing. The AC haplotype shows a significantly higher methylation level than the GT haplotype. This allele-specific methylation is age-dependent. In young individuals (20–30 years) the difference in DNA methylation between haplotypes is significant; whereas in individuals older than 60 years it is not detectable. Interestingly, the GT allele shows a decrease in methylation status with increasing BMI, whereas the methylation of the AC allele is not associated with this phenotype. Heterozygous lymphoblastoid cell lines show the same pattern of allele-specific DNA methylation. The cell line, which exhibits the highest difference in methylation levels between both haplotypes, also shows allele-specific transcription of MCHR1, which can be abolished by treatment with the DNA methylase inhibitor 5-aza-2′-deoxycytidine.
We show that DNA methylation at MCHR1 is allele-specific, age-dependent, BMI-associated and affects transcription. Conceivably, this epigenetic regulation contributes to the age- and/or population specific effects reported for MCHR1 in several human obesity studies.
In highly copy number variable (CNV) regions such as the human defensin gene locus, comprehensive assessment of sequence variations is challenging. PCR approaches are practically restricted to tiny fractions, and next-generation sequencing (NGS) approaches of whole individual genomes e.g. by the 1000 Genomes Project is confined by an affordable sequence depth. Combining target enrichment with NGS may represent a feasible approach.
As a proof of principle, we enriched a ~850 kb section comprising the CNV defensin gene cluster DEFB, the invariable DEFA part and 11 control regions from two genomes by sequence capture and sequenced it by 454 technology. 6,651 differences to the human reference genome were found. Comparison to HapMap genotypes revealed sensitivities and specificities in the range of 94% to 99% for the identification of variations.
Using error probabilities for rigorous filtering revealed 2,886 unique single nucleotide variations (SNVs) including 358 putative novel ones. DEFB CN determinations by haplotype ratios were in agreement with alternative methods.
Although currently labor extensive and having high costs, target enriched NGS provides a powerful tool for the comprehensive assessment of SNVs in highly polymorphic CNV regions of individual genomes. Furthermore, it reveals considerable amounts of putative novel variations and simultaneously allows CN estimation.
Millions of humans and animals suffer from superficial infections caused by a group of highly specialized filamentous fungi, the dermatophytes, which exclusively infect keratinized host structures. To provide broad insights into the molecular basis of the pathogenicity-associated traits, we report the first genome sequences of two closely phylogenetically related dermatophytes, Arthroderma benhamiae and Trichophyton verrucosum, both of which induce highly inflammatory infections in humans.
97% of the 22.5 megabase genome sequences of A. benhamiae and T. verrucosum are unambiguously alignable and collinear. To unravel dermatophyte-specific virulence-associated traits, we compared sets of potentially pathogenicity-associated proteins, such as secreted proteases and enzymes involved in secondary metabolite production, with those of closely related onygenales (Coccidioides species) and the mould Aspergillus fumigatus. The comparisons revealed expansion of several gene families in dermatophytes and disclosed the peculiarities of the dermatophyte secondary metabolite gene sets. Secretion of proteases and other hydrolytic enzymes by A. benhamiae was proven experimentally by a global secretome analysis during keratin degradation. Molecular insights into the interaction of A. benhamiae with human keratinocytes were obtained for the first time by global transcriptome profiling. Given that A. benhamiae is able to undergo mating, a detailed comparison of the genomes further unraveled the genetic basis of sexual reproduction in this species.
Our results enlighten the genetic basis of fundamental and putatively virulence-related traits of dermatophytes, advancing future research on these medically important pathogens.
The innate immune system employs several receptor families that form the basis of sensing pathogen-associated molecular patterns. NOD (nucleotide-binding and oligomerization domain) like receptors (NLRs) comprise a group of cytosolic proteins that trigger protective responses upon recognition of intracellular danger signals. NOD2 displays a tandem caspase recruitment domain (CARD) architecture, which is unique within the NLR family.
Here, we report a novel alternative transcript of the NOD2 gene, which codes for a truncated tandem CARD only protein, called NOD2-C2. The transcript isoform is highest expressed in leucocytes, a natural barrier against pathogen invasion, and is strictly linked to promoter usage as well as predominantly to one allele of the single nucleotide polymorphism rs2067085. Contrary to a previously identified truncated single CARD NOD2 isoform, NOD2-S, NOD2-C2 is able to activate NF-κB in a dose dependent manner independently of muramyl dipeptide (MDP). On the other hand NOD2-C2 competes with MDPs ability to activate the NOD2-driven NF-κB signaling cascade.
NOD2 transcripts having included an alternative exon downstream of exon 3 (exon 3a) are the endogenous equivalents of a previously described in vitro construct with the putative protein composed of only the two N-terminal CARDs. This protein form (NOD2-C2) activates NF-κB independent of an MDP stimulus and is a potential regulator of NOD2 signaling.
Subtle alternative splicing events involving tandem splice sites separated by a short (2-12 nucleotides) distance are frequent and evolutionarily widespread in eukaryotes, and a major contributor to the complexity of transcriptomes and proteomes. However, these events have been either omitted altogether in databases on alternative splicing, or only the cases of experimentally confirmed alternative splicing have been reported. Thus, a database which covers all confirmed cases of subtle alternative splicing as well as the numerous putative tandem splice sites (which might be confirmed once more transcript data becomes available), and allows to search for tandem splice sites with specific features and download the results, is a valuable resource for targeted experimental studies and large-scale bioinformatics analyses of tandem splice sites. Towards this goal we recently set up TassDB (Tandem Splice Site DataBase, version 1), which stores data about alternative splicing events at tandem splice sites separated by 3 nt in eight species.
We have substantially revised and extended TassDB. The currently available version 2 contains extensive information about tandem splice sites separated by 2-12 nt for the human and mouse transcriptomes including data on the conservation of the tandem motifs in five vertebrates. TassDB2 offers a user-friendly interface to search for specific genes or for genes containing tandem splice sites with specific features as well as the possibility to download result datasets. For example, users can search for cases of alternative splicing where the proportion of EST/mRNA evidence supporting the minor isoform exceeds a specific threshold, or where the difference in splice site scores is specified by the user. The predicted impact of each event on the protein is also reported, along with information about being a putative target for the nonsense-mediated decay (NMD) pathway. Links are provided to the UCSC genome browser and other external resources.
TassDB2, available via http://www.tassdb.info, provides comprehensive resources for researchers interested in both targeted experimental studies and large-scale bioinformatics analyses of short distance tandem splice sites.
Alternative splicing (AS) involving tandem acceptors that are separated by three nucleotides (NAGNAG) is an evolutionarily widespread class of AS, which is well studied in Homo sapiens (human) and Mus musculus (mouse). It has also been shown to be common in the model seed plants Arabidopsis thaliana and Oryza sativa (rice). In one of the first studies involving sequence-based prediction of AS in plants, we performed a genome-wide identification and characterization of NAGNAG AS in the model plant Physcomitrella patens, a moss.
Using Sanger data, we found 295 alternatively used NAGNAG acceptors in P. patens. Using 31 features and training and test datasets of constitutive and alternative NAGNAGs, we trained a classifier to predict the splicing outcome at NAGNAG tandem splice sites (alternative splicing, constitutive at the first acceptor, or constitutive at the second acceptor). Our classifier achieved a balanced specificity and sensitivity of ≥ 89%. Subsequently, a classifier trained exclusively on data well supported by transcript evidence was used to make genome-wide predictions of NAGNAG splicing outcomes. By generation of more transcript evidence from a next-generation sequencing platform (Roche 454), we found additional evidence for NAGNAG AS, with altogether 664 alternative NAGNAGs being detected in P. patens using all currently available transcript evidence. The 454 data also enabled us to validate the predictions of the classifier, with 64% (80/125) of the well-supported cases of AS being predicted correctly.
NAGNAG AS is just as common in the moss P. patens as it is in the seed plants A. thaliana and O. sativa (but not conserved on the level of orthologous introns), and can be predicted with high accuracy. The most informative features are the nucleotides in the NAGNAG and in its immediate vicinity, along with the splice sites scores, as found earlier for NAGNAG AS in animals. Our results suggest that the mechanism behind NAGNAG AS in plants is similar to that in animals and is largely dependent on the splice site and its immediate neighborhood.
The beta-defensin gene cluster (DEFB) at chromosome 8p23.1 is one of the most copy number (CN) variable regions of the human genome. Whereas individual DEFB CNs have been suggested as independent genetic risk factors for several diseases (e.g. psoriasis and Crohn's disease), the role of multisite sequence variations (MSV) is less well understood and to date has only been reported for prostate cancer. Simultaneous assessment of MSVs and CNs can be achieved by PCR, cloning and Sanger sequencing, however, these methods are labour and cost intensive as well as prone to methodological bias introduced by bacterial cloning. Here, we demonstrate that amplicon sequencing of pooled individual PCR products by the 454 technology allows in-depth determination of MSV haplotypes and estimation of DEFB CNs in parallel.
Six PCR products spread over ~87 kb of DEFB and harbouring 24 known MSVs were amplified from 11 DNA samples, pooled and sequenced on a Roche 454 GS FLX sequencer. From ~142,000 reads, ~120,000 haplotype calls (HC) were inferred that identified 22 haplotypes ranging from 2 to 7 per amplicon. In addition to the 24 known MSVs, two additional sequence variations were detected. Minimal CNs were estimated from the ratio of HCs and compared to absolute CNs determined by alternative methods. Concordance in CNs was found for 7 samples, the CNs differed by one in 2 samples and the estimated minimal CN was half of the absolute in one sample. For 7 samples and 2 amplicons, the 454 haplotyping results were compared to those by cloning/Sanger sequencing. Intrinsic problems related to chimera formation during PCR and differences between haplotyping by 454 and cloning/Sanger sequencing are discussed.
Deep amplicon sequencing using the 454 technology yield thousands of HCs per amplicon for an affordable price and may represent an effective method for parallel haplotyping and CN estimation in small to medium-sized cohorts. The obtained haplotypes represent a valuable resource to facilitate further studies of the biomedical impact of highly CN variable loci such as the beta-defensin locus.
The aim of the study was to resolve the genetic etiology in families having inherited cataracts.
Families afflicted with congenital/childhood cataracts were registered in Chennai and Orissa (India). Blood samples were collected from the probands and available family members. Selected functional candidate genes were amplified by polymerase chain reaction (PCR) and characterized by direct sequencing. Putative mutations were confirmed in healthy controls.
We observed interesting new polymorphisms of ethnic specificity, some of frequent nature, such as a 3-bp deletion in intron 3 of CRYBB2 (encoding βB2-crystallin) and IVS1+9 c>t variation in HSF4 (encoding heat-shock factor 4). Some rare single nucleotide polymorphisms (SNPs) co-segregate with the respective phenotype such as IVS3+120c>a of CRYBB2, while M44V of CRYGD (encoding γD-crystallin), although found in association with blue dot opacity was seen in a few healthy controls too. We identified two new mutations co-segregating along with the respective cataract phenotype within the families that were not seen in healthy controls from India or Germany. These include two missense mutations; one in GJA3 (encoding gap junction protein α3, which is also referred to as connexin 46); the mutation affects codon 19 (T19M), and the corresponding phenotype is a posterior-polar cataract. The other missense mutation affects CRYBB2 (W59C; total cataract). Additionally, a cDNA variation (G54A) identified in a zonular cataract affects a highly conserved splice site of CRYBB2. This mutation, however, showed reduced penetrance in the family, which might be explained by different molecular consequences in the affected family members: nonsense-mediated decay of the mutated mRNA might have no clinical phenotype in heterozygotes, whereas the translation of the mutated mRNA is predicted to lead to a small hybrid protein (consisting of 16 amino acids of the βB2-crystallin and 18 new amino-acids), which might have a dominant-negative function in the lens.
This report identifies in families with childhood cataract some new alleles, which may be considered as causative for cataracts. Furthermore, we report some geographically restricted rare polymorphic sites, whose significance might be considered in some context as modifiers or alleles in sensitizing ocular lens toward cataractogenesis.
De novo sequencing the entire genome of a large complex plant genome like the one of barley (Hordeum vulgare L.) is a major challenge both in terms of experimental feasibility and costs. The emergence and breathtaking progress of next generation sequencing technologies has put this goal into focus and a clone based strategy combined with the 454/Roche technology is conceivable.
To test the feasibility, we sequenced 91 barcoded, pooled, gene containing barley BACs using the GS FLX platform and assembled the sequences under iterative change of parameters. The BAC assemblies were characterized by N50 of ~50 kb (N80 ~31 kb, N90 ~21 kb) and a Q40 of 94%. For ~80% of the clones, the best assemblies consisted of less than 10 contigs at 24-fold mean sequence coverage. Moreover we show that gene containing regions seem to assemble completely and uninterrupted thus making the approach suitable for detecting complete and positionally anchored genes.
By comparing the assemblies of four clones to their complete reference sequences generated by the Sanger method, we evaluated the distribution, quality and representativeness of the 454 sequences as well as the consistency and reliability of the assemblies.
The described multiplex 454 sequencing of barcoded BACs leads to sequence consensi highly representative for the clones. Assemblies are correct for the majority of contigs. Though the resolution of complex repetitive structures requires additional experimental efforts, our approach paves the way for a clone based strategy of sequencing the barley genome.
Alternative splicing (AS) involving NAGNAG tandem acceptors is an evolutionarily widespread class of AS. Recent predictions of alternative acceptor usage reported better results for acceptors separated by larger distances, than for NAGNAGs. To improve the latter, we aimed at the use of Bayesian networks (BN), and extensive experimental validation of the predictions. Using carefully constructed training and test datasets, a balanced sensitivity and specificity of ≥92% was achieved. A BN trained on the combined dataset was then used to make predictions, and 81% (38/47) of the experimentally tested predictions were verified. Using a BN learned on human data on six other genomes, we show that while the performance for the vertebrate genomes matches that achieved on human data, there is a slight drop for Drosophila and worm. Lastly, using the prediction accuracy according to experimental validation, we estimate the number of yet undiscovered alternative NAGNAGs. State of the art classifiers can produce highly accurate prediction of AS at NAGNAGs, indicating that we have identified the major features of the ‘NAGNAG-splicing code’ within the splice site and its immediate neighborhood. Our results suggest that the mechanism behind NAGNAG AS is simple, stochastic, and conserved among vertebrates and beyond.
The human X chromosome has a unique biology that was shaped by its evolution as the sex chromosome shared by males and females. We have determined 99.3% of the euchromatic sequence of the X chromosome. Our analysis illustrates the autosomal origin of the mammalian sex chromosomes, the stepwise process that led to the progressive loss of recombination between X and Y, and the extent of subsequent degradation of the Y chromosome. LINE1 repeat elements cover one-third of the X chromosome, with a distribution that is consistent with their proposed role as way stations in the process of X-chromosome inactivation. We found 1,098 genes in the sequence, of which 99 encode proteins expressed in testis and in various tumour types. A disproportionately high number of mendelian diseases are documented for the X chromosome. Of this number, 168 have been explained by mutations in 113 X-linked genes, which in many cases were characterized with the aid of the DNA sequence.