Search tips
Search criteria

Results 1-25 (330)

Clipboard (0)
Year of Publication
1.  A novel statistical method to estimate the effective SNP size in vertebrate genomes and categorized genomic regions 
BMC Genomics  2006;7:329.
The local environment of single nucleotide polymorphisms (SNPs) contains abundant genetic information for the study of mechanisms of mutation, genome evolution, and causes of diseases. Recent studies revealed that neighboring-nucleotide biases on SNPs were strong and the genome-wide bias patterns could be represented by a small subset of the total SNPs. It remains unsolved for the estimation of the effective SNP size, the number of SNPs that are sufficient to represent the bias patterns observed from the whole SNP data.
To estimate the effective SNP size, we developed a novel statistical method, SNPKS, which considers both the statistical and biological significances. SNPKS consists of two major steps: to obtain an initial effective size by the Kolmogorov-Smirnov test (KS test) and to find an intermediate effective size by interval evaluation. The SNPKS algorithm was implemented in computer programs and applied to the real SNP data. The effective SNP size was estimated to be 38,200, 39,300, 38,000, and 38,700 in the human, chimpanzee, dog, and mouse genomes, respectively, and 39,100, 39,600, 39,200, and 42,200 in human intergenic, genic, intronic, and CpG island regions, respectively.
SNPKS is the first statistical method to estimate the effective SNP size. It runs efficiently and greatly outperforms the algorithm implemented in SNPNB. The application of SNPKS to the real SNP data revealed the similar small effective SNP size (38,000 – 42,200) in the human, chimpanzee, dog, and mouse genomes as well as in human genomic regions. The findings suggest strong influence of genetic factors across vertebrate genomes.
PMCID: PMC1769377  PMID: 17196097
2.  Array-CGH and multipoint FISH to decode complex chromosomal rearrangements 
BMC Genomics  2006;7:330.
Recently, several high-resolution methods of chromosome analysis have been developed. It is important to compare these methods and to select reliable combinations of techniques to analyze complex chromosomal rearrangements in tumours. In this study we have compared array-CGH (comparative genomic hybridization) and multipoint FISH (mpFISH) for their ability to characterize complex rearrangements on human chromosome 3 (chr3) in tumour cell lines. We have used 179 BAC/PAC clones covering chr3 with an approximately 1 Mb resolution to analyze nine carcinoma lines. Chr3 was chosen for analysis, because of its frequent rearrangements in human solid tumours.
The ploidy of the tumour cell lines ranged from near-diploid to near-pentaploid. Chr3 locus copy number was assessed by interphase and metaphase mpFISH. Totally 53 chr3 fragments were identified having copy numbers from 0 to 14. MpFISH results from the BAC/PAC clones and array-CGH gave mainly corresponding results. Each copy number change on the array profile could be related to a specific chromosome aberration detected by metaphase mpFISH. The analysis of the correlation between real copy number from mpFISH and the average normalized inter-locus fluorescence ratio (ANILFR) value detected by array-CGH demonstrated that copy number is a linear function of parameters that include the variable, ANILFR, and two constants, ploidy and background normalized fluorescence ratio.
In most cases, the changes in copy number seen on array-CGH profiles reflected cumulative chromosome rearrangements. Most of them stemmed from unbalanced translocations. Although our chr3 BAC/PAC array could identify single copy number changes even in pentaploid cells, mpFISH provided a more accurate analysis in the dissection of complex karyotypes at high ploidy levels.
PMCID: PMC1769374  PMID: 17196103
3.  Comprehensive analysis of alternative splicing in rice and comparative analyses with Arabidopsis 
BMC Genomics  2006;7:327.
Recently, genomic sequencing efforts were finished for Oryza sativa (cultivated rice) and Arabidopsis thaliana (Arabidopsis). Additionally, these two plant species have extensive cDNA and expressed sequence tag (EST) libraries. We employed the Program to Assemble Spliced Alignments (PASA) to identify and analyze alternatively spliced isoforms in both species.
A comprehensive analysis of alternative splicing was performed in rice that started with >1.1 million publicly available spliced ESTs and over 30,000 full length cDNAs in conjunction with the newly enhanced PASA software. A parallel analysis was performed with Arabidopsis to compare and ascertain potential differences between monocots and dicots. Alternative splicing is a widespread phenomenon (observed in greater than 30% of the loci with transcript support) and we have described nine alternative splicing variations. While alternative splicing has the potential to create many RNA isoforms from a single locus, the majority of loci generate only two or three isoforms and transcript support indicates that these isoforms are generally not rare events. For the alternate donor (AD) and acceptor (AA) classes, the distance between the splice sites for the majority of events was found to be less than 50 basepairs (bp). In both species, the most frequent distance between AA is 3 bp, consistent with reports in mammalian systems. Conversely, the most frequent distance between AD is 4 bp in both plant species, as previously observed in mouse. Most alternative splicing variations are localized to the protein coding sequence and are predicted to significantly alter the coding sequence.
Alternative splicing is widespread in both rice and Arabidopsis and these species share many common features. Interestingly, alternative splicing may play a role beyond creating novel combinations of transcripts that expand the proteome. Many isoforms will presumably have negative consequences for protein structure and function, suggesting that their biological role involves post-transcriptional regulation of gene expression.
PMCID: PMC1769492  PMID: 17194304
4.  Linkage disequilibrium of evolutionarily conserved regions in the human genome 
BMC Genomics  2006;7:326.
The strong linkage disequilibrium (LD) recently found in genic or exonic regions of the human genome demonstrated that LD can be increased by evolutionary mechanisms that select for functionally important loci. This suggests that LD might be stronger in regions conserved among species than in non-conserved regions, since regions exposed to natural selection tend to be conserved. To assess this hypothesis, we used genome-wide polymorphism data from the HapMap project and investigated LD within DNA sequences conserved between the human and mouse genomes.
Unexpectedly, we observed that LD was significantly weaker in conserved regions than in non-conserved regions. To investigate why, we examined sequence features that may distort the relationship between LD and conserved regions. We found that interspersed repeats, and not other sequence features, were associated with the weak LD tendency in conserved regions. To appropriately understand the relationship between LD and conserved regions, we removed the effect of repetitive elements and found that the high degree of sequence conservation was strongly associated with strong LD in coding regions but not with that in non-coding regions.
Our work demonstrates that the degree of sequence conservation does not simply increase LD as predicted by the hypothesis. Rather, it implies that purifying selection changes the polymorphic patterns of coding sequences but has little influence on the patterns of functional units such as regulatory elements present in non-coding regions, since the former are generally restricted by the constraint of maintaining a functional protein product across multiple exons while the latter may exist more as individually isolated units.
PMCID: PMC1769491  PMID: 17192199
5.  Detection of transcriptional difference of porcine imprinted genes using different microarray platforms 
BMC Genomics  2006;7:328.
Presently, multiple options exist for conducting gene expression profiling studies in swine. In order to determine the performance of some of the existing microarrays, Affymetrix Porcine, Affymetrix Human U133+2.0, and the U.S. Pig Genome Coordination Program spotted glass oligonucleotide microarrays were compared for their reproducibility, coverage, platform independent and dependent sensitivity using fibroblast cell lines derived from control and parthenogenic porcine embryos.
Array group correlations between technical replicates demonstrated comparable reproducibility in both Affymetrix arrays. Glass oligonucleotide arrays showed greater variability and, in addition, approximately 10% of probes had to be discarded due to slide printing defects. Probe level analysis of Affymetrix Human arrays revealed significant variability within probe sets due to the effects of cross-species hybridization. Affymetrix Porcine arrays identified the greatest number of differentially expressed genes amongst probes common to all arrays, a measure of platform sensitivity. Affymetrix Porcine arrays also identified the greatest number of differentially expressed known imprinted genes using all probes on each array, an ad hoc measure of realistic performance for this particular experiment.
We conclude that of the platforms currently available and tested, the Affymetrix Porcine array is the most sensitive and reproducible microarray for swine genomic studies.
PMCID: PMC1769376  PMID: 17194308
6.  Alternative splicing and differential gene expression in colon cancer detected by a whole genome exon array 
BMC Genomics  2006;7:325.
Alternative splicing is a mechanism for increasing protein diversity by excluding or including exons during post-transcriptional processing. Alternatively spliced proteins are particularly relevant in oncology since they may contribute to the etiology of cancer, provide selective drug targets, or serve as a marker set for cancer diagnosis. While conventional identification of splice variants generally targets individual genes, we present here a new exon-centric array (GeneChip Human Exon 1.0 ST) that allows genome-wide identification of differential splice variation, and concurrently provides a flexible and inclusive analysis of gene expression.
We analyzed 20 paired tumor-normal colon cancer samples using a microarray designed to detect over one million putative exons that can be virtually assembled into potential gene-level transcripts according to various levels of prior supporting evidence. Analysis of high confidence (empirically supported) transcripts identified 160 differentially expressed genes, with 42 genes occupying a network impacting cell proliferation and another twenty nine genes with unknown functions. A more speculative analysis, including transcripts based solely on computational prediction, produced another 160 differentially expressed genes, three-fourths of which have no previous annotation. We also present a comparison of gene signal estimations from the Exon 1.0 ST and the U133 Plus 2.0 arrays.
Novel splicing events were predicted by experimental algorithms that compare the relative contribution of each exon to the cognate transcript intensity in each tissue. The resulting candidate splice variants were validated with RT-PCR. We found nine genes that were differentially spliced between colon tumors and normal colon tissues, several of which have not been previously implicated in cancer. Top scoring candidates from our analysis were also found to substantially overlap with EST-based bioinformatic predictions of alternative splicing in cancer.
Differential expression of high confidence transcripts correlated extremely well with known cancer genes and pathways, suggesting that the more speculative transcripts, largely based solely on computational prediction and mostly with no previous annotation, might be novel targets in colon cancer. Five of the identified splicing events affect mediators of cytoskeletal organization (ACTN1, VCL, CALD1, CTTN, TPM1), two affect extracellular matrix proteins (FN1, COL6A3) and another participates in integrin signaling (SLC3A2). Altogether they form a pattern of colon-cancer specific alterations that may particularly impact cell motility.
PMCID: PMC1769375  PMID: 17192196
7.  SIGMA: A System for Integrative Genomic Microarray Analysis of Cancer Genomes 
BMC Genomics  2006;7:324.
The prevalence of high resolution profiling of genomes has created a need for the integrative analysis of information generated from multiple methodologies and platforms. Although the majority of data in the public domain are gene expression profiles, and expression analysis software are available, the increase of array CGH studies has enabled integration of high throughput genomic and gene expression datasets. However, tools for direct mining and analysis of array CGH data are limited. Hence, there is a great need for analytical and display software tailored to cross platform integrative analysis of cancer genomes.
We have created a user-friendly java application to facilitate sophisticated visualization and analysis such as cross-tumor and cross-platform comparisons. To demonstrate the utility of this software, we assembled array CGH data representing Affymetrix SNP chip, Stanford cDNA arrays and whole genome tiling path array platforms for cross comparison. This cancer genome database contains 267 profiles from commonly used cancer cell lines representing 14 different tissue types.
In this study we have developed an application for the visualization and analysis of data from high resolution array CGH platforms that can be adapted for analysis of multiple types of high throughput genomic datasets. Furthermore, we invite researchers using array CGH technology to deposit both their raw and processed data, as this will be a continually expanding database of cancer genomes. This publicly available resource, the System for Integrative Genomic Microarray Analysis (SIGMA) of cancer genomes, can be accessed at .
PMCID: PMC1764892  PMID: 17192189
8.  Conservation of noncoding microsatellites in plants: implication for gene regulation 
BMC Genomics  2006;7:323.
Microsatellites are extremely common in plant genomes, and in particular, they are significantly enriched in the 5' noncoding regions. Although some 5' noncoding microsatellites involved in gene regulation have been described, the general properties of microsatellites as regulatory elements are still unknown. To address the question of microsatellites associated with regulatory elements, we have analyzed the conserved noncoding microsatellite sequences (CNMSs) in the 5' noncoding regions by inter- and intragenomic phylogenetic footprinting in the Arabidopsis and Brassica genomes.
We identified 247 Arabidopsis-Brassica orthologous and 122 Arabidopsis paralogous CNMSs, representing 491 CT/GA and CTT/GAA repeats, which accounted for 10.6% of these types located in the 500-bp regions upstream of coding sequences in the Arabidopsis genome. Among these identified CNMSs, 18 microsatellites show high conservation in the regulatory regions of both orthologous and paralogous genes, and some of them also appear in the corresponding positions of more distant homologs in Arabidopsis, as well as in other plants. A computational scan of CNMSs for known cis-regulatory elements showed that light responsive elements were clustered in the region of CT/GA repeats, as well as salicylic acid responsive elements in the (CTT)n/(GAA)n sequences. Patterns of gene expression revealed that 70–80% of CNMS (CTT)n/(GAA)n associated genes were regulated by salicylic acid, which was consistent with the prediction of regulatory elements in silico.
Our analyses showed that some noncoding microsatellites were conserved in plants and appeared to be ancient. These CNMSs served as regulatory elements involved in light and salicylic acid responses. Our findings might have implications in the common features of the over-represented microsatellites for gene regulation in plant-specific pathways.
PMCID: PMC1781443  PMID: 17187690
9.  The major histocompatibility complex (Mhc) class IIB region has greater genomic structural flexibility and diversity in the quail than the chicken 
BMC Genomics  2006;7:322.
The quail and chicken major histocompatibility complex (Mhc) genomic regions have a similar overall organization but differ markedly in that the quail has an expanded number of duplicated class I, class IIB, natural killer (NK)-receptor-like, lectin-like and BG genes. Therefore, the elucidation of genetic factors that contribute to the greater Mhc diversity in the quail would help to establish it as a model experimental animal in the investigation of avian Mhc associated diseases.
Aims and approaches
The main aim here was to characterize the genetic and genomic features of the transcribed major quail MhcIIB (CojaIIB) region that is located between the Tapasin and BRD2 genes, and to compare our findings to the available information for the chicken MhcIIB (BLB). We used four approaches in the study of the quail MhcIIB region, (1) haplotype analyses with polymorphic loci, (2) cloning and sequencing of the RT-PCR CojaIIB products from individuals with different haplotypes, (3) genomic sequencing of the CojaIIB region from the individuals with the different haplotypes, and (4) phylogenetic and duplication analysis to explain the variability of the region between the quail and the chicken.
Our results show that the Tapasin-BRD2 segment of the quail Mhc is highly variable in length and in gene transcription intensity and content. Haplotypic sequences were found to vary in length between 4 to 11 kb. Tapasin-BRD2 segments contain one or two major transcribed CojaIIBs that were probably generated by segmental duplications involving c-type lectin-like genes and NK receptor-like genes, gene fusions between two CojaIIBs and transpositions between the major and minor CojaIIB segments. The relative evolutionary speed for generating the MhcIIBs genomic structures from the ancestral BLB2 was estimated to be two times faster in the quail than in the chicken after their separation from a common ancestor. Four types of genomic rearrangement elements (GRE), composed of simple tandem repeats (STR), were identified in the MhcIIB genomic segment located between the Tapasin-BRD2 genes. The GREs have many more STR numbers in the quail than in the chicken that displays strong linkage disequilibrium.
This study suggests that the Mhc classIIB region has a flexible genomic structure generated by rearrangement elements and rapid SNP accumulation probably as a consequence of the quail adapting to environmental conditions and pathogens during its migratory history after its divergence from the chicken.
PMCID: PMC1769493  PMID: 17184537
10.  Quantitative analysis of cell-type specific gene expression in the green alga Volvox carteri 
BMC Genomics  2006;7:321.
The multicellular alga Volvox carteri possesses only two cell types: mortal, motile somatic cells and potentially immortal, immotile reproductive cells. It is therefore an attractive model system for studying how cell-autonomous cytodifferentiation is programmed within a genome. Moreover, there are ongoing genome projects both in Volvox carteri and in the closely related unicellular alga Chlamydomonas reinhardtii. However, gene sequencing is only the beginning. To identify cell-type specific expression and to determine relative expression rates, we evaluate the potential of real-time RT-PCR for quantifying gene transcript levels.
Here we analyze a diversified pool of 39 target genes by real-time RT-PCR for each cell type. This gene pool contains previously known genes with unknown localization of cellular expression, 28 novel genes which are described in this study for the first time, and a few known, cell-type specific genes as a control. The respective gene products are, for instance, part of photosynthesis, cellular regulation, stress response, or transport processes. We provide expression data for all these genes.
The results show that quantitative real-time RT-PCR is a favorable approach to analyze cell-type specific gene expression in Volvox, which can be extended to a much larger number of genes or to developmental or metabolic mutants. Our expression data also provide a basis for a detailed analysis of individual, previously unknown, cell-type specifically expressed genes.
PMCID: PMC1774577  PMID: 17184518
11.  Changes in skeletal muscle gene expression following clenbuterol administration 
BMC Genomics  2006;7:320.
Beta-adrenergic receptor agonists (BA) induce skeletal muscle hypertrophy, yet specific mechanisms that lead to this effect are not well understood. The objective of this research was to identify novel genes and physiological pathways that potentially facilitate BA induced skeletal muscle growth. The Affymetrix platform was utilized to identify gene expression changes in mouse skeletal muscle 24 hours and 10 days after administration of the BA clenbuterol.
Administration of clenbuterol stimulated anabolic activity, as indicated by decreased blood urea nitrogen (BUN; P < 0.01) and increased body weight gain (P < 0.05) 24 hours or 10 days, respectively, after initiation of clenbuterol treatment. A total of 22,605 probesets were evaluated with 52 probesets defined as differentially expressed based on a false discovery rate of 10%. Differential mRNA abundance of four of these genes was validated in an independent experiment by quantitative PCR. Functional characterization of differentially expressed genes revealed several categories that participate in biological processes important to skeletal muscle growth, including regulators of transcription and translation, mediators of cell-signalling pathways, and genes involved in polyamine metabolism.
Global evaluation of gene expression after administration of clenbuterol identified changes in gene expression and overrepresented functional categories of genes that may regulate BA-induced muscle hypertrophy. Changes in mRNA abundance of multiple genes associated with myogenic differentiation may indicate an important effect of BA on proliferation, differentiation, and/or recruitment of satellite cells into muscle fibers to promote muscle hypertrophy. Increased mRNA abundance of genes involved in the initiation of translation suggests that increased levels of protein synthesis often associated with BA administration may result from a general up-regulation of translational initiators. Additionally, numerous other genes and physiological pathways were identified that will be important targets for further investigations of the hypertrophic effect of BA on skeletal muscle.
PMCID: PMC1766935  PMID: 17181869
12.  Systematic interpretation of microarray data using experiment annotations 
BMC Genomics  2006;7:319.
Up to now, microarray data are mostly assessed in context with only one or few parameters characterizing the experimental conditions under study. More explicit experiment annotations, however, are highly useful for interpreting microarray data, when available in a statistically accessible format.
We provide means to preprocess these additional data, and to extract relevant traits corresponding to the transcription patterns under study. We found correspondence analysis particularly well-suited for mapping such extracted traits. It visualizes associations both among and between the traits, the hereby annotated experiments, and the genes, revealing how they are all interrelated. Here, we apply our methods to the systematic interpretation of radioactive (single channel) and two-channel data, stemming from model organisms such as yeast and drosophila up to complex human cancer samples. Inclusion of technical parameters allows for identification of artifacts and flaws in experimental design.
Biological and clinical traits can act as landmarks in transcription space, systematically mapping the variance of large datasets from the predominant changes down toward intricate details.
PMCID: PMC1774576  PMID: 17181856
13.  Unsupervised clustering of gene expression data points at hypoxia as possible trigger for metabolic syndrome 
BMC Genomics  2006;7:318.
Classification of large volumes of data produced in a microarray experiment allows for the extraction of important clues as to the nature of a disease.
Using multi-dimensional unsupervised FOREL (FORmal ELement) algorithm we have re-analyzed three public datasets of skeletal muscle gene expression in connection with insulin resistance and type 2 diabetes (DM2). Our analysis revealed the major line of variation between expression profiles of normal, insulin resistant, and diabetic skeletal muscle. A cluster of most "metabolically sound" samples occupied one end of this line. The distance along this line coincided with the classic markers of diabetes risk, namely obesity and insulin resistance, but did not follow the accepted clinical diagnosis of DM2 as defined by the presence or absence of hyperglycemia. Genes implicated in this expression pattern are those controlling skeletal muscle fiber type and glycolytic metabolism. Additionally myoglobin and hemoglobin were upregulated and ribosomal genes deregulated in insulin resistant patients.
Our findings are concordant with the changes seen in skeletal muscle with altitude hypoxia. This suggests that hypoxia and shift to glycolytic metabolism may also drive insulin resistance.
PMCID: PMC1770922  PMID: 17178004
14.  RINGdb: An integrated database for G protein-coupled receptors and regulators of G protein signaling 
BMC Genomics  2006;7:317.
Many marketed therapeutic agents have been developed to modulate the function of G protein-coupled receptors (GPCRs). The regulators of G-protein signaling (RGS proteins) are also being examined as potential drug targets. To facilitate clinical and pharmacological research, we have developed a novel integrated biological database called RINGdb to provide comprehensive and organized RGS protein and GPCR information.
RINGdb contains information on mutations, tissue distributions, protein-protein interactions, diseases/disorders and other features, which has been automatically collected from the Internet and manually extracted from the literature. In addition, RINGdb offers various user-friendly query functions to answer different questions about RGS proteins and GPCRs such as their possible contribution to disease processes, the putative direct or indirect relationship between RGS proteins and GPCRs. RINGdb also integrates organized database cross-references to allow users direct access to detailed information. The database is now available at .
RINGdb is the only integrated database on the Internet to provide comprehensive RGS protein and GPCR information. This knowledgebase will be useful for clinical research, drug discovery and GPCR signaling pathway research.
PMCID: PMC1764023  PMID: 17173697
15.  Systematic identification and integrative analysis of novel genes expressed specifically or predominantly in mouse epididymis 
BMC Genomics  2006;7:314.
Maturation of spermatozoa, including development of motility and the ability to fertilize the oocyte, occurs during transit through the microenvironment of the epididymis. Comprehensive understanding of sperm maturation requires identification and characterization of unique genes expressed in the epididymis.
We systematically identified 32 novel genes with epididymis-specific or -predominant expression in the mouse epididymis UniGene library, containing 1505 gene-oriented transcript clusters, by in silico and in vitro analyses. The Northern blot analysis revealed various characteristics of the genes at the transcript level, such as expression level, size and the presence of isoform. We found that expression of the half of the genes is regulated by androgens. Further expression analyses demonstrated that the novel genes are region-specific and developmentally regulated. Computational analysis showed that 15 of the genes lack human orthologues, suggesting their implication in male reproduction unique to the mouse. A number of the novel genes are putative epididymal protease inhibitors or β-defensins. We also found that six of the genes have secretory activity, indicating that they may interact with sperm and have functional roles in sperm maturation.
We identified and characterized 32 novel epididymis-specific or -predominant genes by an integrative approach. Our study is unique in the aspect of systematic identification of novel epididymal genes and should be a firm basis for future investigation into molecular mechanisms underlying sperm maturation in the epididymis.
PMCID: PMC1764739  PMID: 17166261
16.  Directionality of point mutation and 5-methylcytosine deamination rates in the chimpanzee genome 
BMC Genomics  2006;7:316.
The pattern of point mutation is important for studying mutational mechanisms, genome evolution, and diseases. Previous studies of mutation direction were largely based on substitution data from a limited number of loci. To date, there is no genome-wide analysis of mutation direction or methylation-dependent transition rates in the chimpanzee or its categorized genomic regions.
In this study, we performed a detailed examination of mutation direction in the chimpanzee genome and its categorized genomic regions using 588,918 SNPs whose ancestral alleles could be inferred by mapping them to human genome sequences. The C→T (G→A) changes occurred most frequently in the chimpanzee genome. Each type of transition occurred approximately four times more frequently than each type of transversion. Notably, the frequency of C→T (G→A) was the highest in exons among the genomic categories regardless of whether we calculated directly, normalized with the nucleotide content, or removed the SNPs involved in the CpG effect. Moreover, the directionality of the point mutation in exons and CpG islands were opposite relative to their corresponding intergenic regions, indicating that different forces govern the nucleotide changes. Our analysis suggests that the GC content is not in equilibrium in the chimpanzee genome. Further quantitative analysis revealed that the 5-methylcytosine deamination rates at CpG sites were highly dependent on the local GC content and the lengths of SNP flanking sequences and varied among categorized genomic regions.
We present the first mutational spectrum, estimated by three different approaches, in the chimpanzee genome. Our results provide detailed information on recent nucleotide changes and methylation-dependent transition rates in the chimpanzee genome after its split from the human. These results have important implications for understanding genome composition evolution, mechanisms of point mutation, and other genetic factors such as selection, biased codon usage, biased gene conversion, and recombination.
PMCID: PMC1764022  PMID: 17166280
17.  High precision multi-genome scale reannotation of enzyme function by EFICAz 
BMC Genomics  2006;7:315.
The functional annotation of most genes in newly sequenced genomes is inferred from similarity to previously characterized sequences, an annotation strategy that often leads to erroneous assignments. We have performed a reannotation of 245 genomes using an updated version of EFICAz, a highly precise method for enzyme function prediction.
Based on our three-field EC number predictions, we have obtained lower-bound estimates for the average enzyme content in Archaea (29%), Bacteria (30%) and Eukarya (18%). Most annotations added in KEGG from 2005 to 2006 agree with EFICAz predictions made in 2005. The coverage of EFICAz predictions is significantly higher than that of KEGG, especially for eukaryotes. Thousands of our novel predictions correspond to hypothetical proteins. We have identified a subset of 64 hypothetical proteins with low sequence identity to EFICAz training enzymes, whose biochemical functions have been recently characterized and find that in 96% (84%) of the cases we correctly identified their three-field (four-field) EC numbers. For two of the 64 hypothetical proteins: PA1167 from Pseudomonas aeruginosa, an alginate lyase (EC and Rv1700 of Mycobacterium tuberculosis H37Rv, an ADP-ribose diphosphatase (EC, we have detected annotation lag of more than two years in databases. Two examples are presented where EFICAz predictions act as hypothesis generators for understanding the functional roles of hypothetical proteins: FLJ11151, a human protein overexpressed in cancer that EFICAz identifies as an endopolyphosphatase (EC, and MW0119, a protein of Staphylococcus aureus strain MW2 that we propose as candidate virulence factor based on its EFICAz predicted activity, sphingomyelin phosphodiesterase (EC
Our results suggest that we have generated enzyme function annotations of high precision and recall. These predictions can be mined and correlated with other information sources to generate biologically significant hypotheses and can be useful for comparative genome analysis and automated metabolic pathway reconstruction.
PMCID: PMC1764738  PMID: 17166279
18.  Large fragment Bst DNA polymerase for whole genome amplification of DNA from formalin-fixed paraffin-embedded tissues 
BMC Genomics  2006;7:312.
Formalin-fixed paraffin-embedded (FFPE) tissues represent the largest source of archival biological material available for genomic studies of human cancer. Therefore, it is desirable to develop methods that enable whole genome amplification (WGA) using DNA extracted from FFPE tissues. Multiple-strand Displacement Amplification (MDA) is an isothermal method for WGA that uses the large fragment of Bst DNA polymerase. To date, MDA has been feasible only for genomic DNA isolated from fresh or snap-frozen tissue, and yields a representational distortion of less than threefold.
We amplified genomic DNA of five FFPE samples of normal human lung tissue with the large fragment of Bst DNA polymerase. Using quantitative PCR, the copy number of 7 genes was evaluated in both amplified and original DNA samples. Four neuroblastoma xenograft samples derived from cell lines with known N-myc gene copy number were also evaluated, as were 7 samples of non-small cell lung cancer (NSCLC) tumors with known Skp2 gene amplification. In addition, we compared the array comparative genomic hybridization (CGH)-based genome profiles of two NSCLC samples before and after Bst MDA. A median 990-fold amplification of DNA was achieved. The DNA amplification products had a very high molecular weight (> 23 Kb). When the gene content of the amplified samples was compared to that of the original samples, the representational distortion was limited to threefold. Array CGH genome profiles of amplified and non-amplified FFPE DNA were similar.
Large fragment Bst DNA polymerase is suitable for WGA of DNA extracted from FFPE tissues, with an expected maximal representational distortion of threefold. Amplified DNA may be used for the detection of gene copy number changes by quantitative realtime PCR and genome profiling by array CGH.
PMCID: PMC1764024  PMID: 17156491
19.  The DNA-damage signature in Saccharomyces cerevisiae is associated with single-strand breaks in DNA 
BMC Genomics  2006;7:313.
Upon exposure to agents that damage DNA, Saccharomyces cerevisiae undergo widespread reprogramming of gene expression. Such a vast response may be due not only to damage to DNA but also damage to proteins, RNA, and lipids. Here the transcriptional response of S. cerevisiae specifically induced by DNA damage was discerned by exposing S. cerevisiae to a panel of three "radiomimetic" enediyne antibiotics (calicheamicin γ1I, esperamicin A1 and neocarzinostatin) that bind specifically to DNA and generate varying proportions of single- and double-strand DNA breaks. The genome-wide responses were compared to those induced by the non-selective oxidant γ-radiation.
Given well-controlled exposures that resulted in similar and minimal cell death (~20–25%) across all conditions, the extent of gene expression modulation was markedly different depending on treatment with the enediynes or γ-radiation. Exposure to γ-radiation resulted in more extensive transcriptional changes classified both by the number of genes modulated and the magnitude of change. Common biological responses were identified between the enediynes and γ-radiation, with the induction of DNA repair and stress response genes, and the repression of ribosomal biogenesis genes. Despite these common responses, a fraction of the response induced by gamma radiation was repressed by the enediynes and vise versa, suggesting that the enediyne response is not entirely "radiomimetic." Regression analysis identified 55 transcripts with gene expression induction associated both with double- or single-strand break formation. The S. cerevisiae "DNA damage signature" genes as defined by Gasch et al. [1] were enriched among regulated transcripts associated with single-strand breaks, while genes involved in cell cycle regulation were associated with double-strand breaks.
Dissection of the transcriptional response in yeast that is specifically signaled by DNA strand breaks has identified that single-strand breaks provide the signal for activation of transcripts encoding proteins involved in the DNA damage signature in S. cerevisiae, and double-strand breaks signal changes in cell cycle regulation genes.
PMCID: PMC1764021  PMID: 17163986
20.  Deep and comparative analysis of the mycelium and appressorium transcriptomes of Magnaporthe grisea using MPSS, RL-SAGE, and oligoarray methods 
BMC Genomics  2006;7:310.
Rice blast, caused by the fungal pathogen Magnaporthe grisea, is a devastating disease causing tremendous yield loss in rice production. The public availability of the complete genome sequence of M. grisea provides ample opportunities to understand the molecular mechanism of its pathogenesis on rice plants at the transcriptome level. To identify all the expressed genes encoded in the fungal genome, we have analyzed the mycelium and appressorium transcriptomes using massively parallel signature sequencing (MPSS), robust-long serial analysis of gene expression (RL-SAGE) and oligoarray methods.
The MPSS analyses identified 12,531 and 12,927 distinct significant tags from mycelia and appressoria, respectively, while the RL-SAGE analysis identified 16,580 distinct significant tags from the mycelial library. When matching these 12,531 mycelial and 12,927 appressorial significant tags to the annotated CDS, 500 bp upstream and 500 bp downstream of CDS, 6,735 unique genes in mycelia and 7,686 unique genes in appressoria were identified. A total of 7,135 mycelium-specific and 7,531 appressorium-specific significant MPSS tags were identified, which correspond to 2,088 and 1,784 annotated genes, respectively, when matching to the same set of reference sequences. Nearly 85% of the significant MPSS tags from mycelia and appressoria and 65% of the significant tags from the RL-SAGE mycelium library matched to the M. grisea genome. MPSS and RL-SAGE methods supported the expression of more than 9,000 genes, representing over 80% of the predicted genes in M. grisea. About 40% of the MPSS tags and 55% of the RL-SAGE tags represent novel transcripts since they had no matches in the existing M. grisea EST collections. Over 19% of the annotated genes were found to produce both sense and antisense tags in the protein-coding region. The oligoarray analysis identified the expression of 3,793 mycelium-specific and 4,652 appressorium-specific genes. A total of 2,430 mycelial genes and 1,886 appressorial genes were identified by both MPSS and oligoarray.
The comprehensive and deep transcriptome analysis by MPSS and RL-SAGE methods identified many novel sense and antisense transcripts in the M. grisea genome at two important growth stages. The differentially expressed transcripts that were identified, especially those specifically expressed in appressoria, represent a genomic resource useful for gaining a better understanding of the molecular basis of M. grisea pathogenicity. Further analysis of the novel antisense transcripts will provide new insights into the regulation and function of these genes in fungal growth, development and pathogenesis in the host plants.
PMCID: PMC1764740  PMID: 17156450
21.  Compensatory relationship between splice sites and exonic splicing signals depending on the length of vertebrate introns 
BMC Genomics  2006;7:311.
The signals that determine the specificity and efficiency of splicing are multiple and complex, and are not fully understood. Among other factors, the relative contributions of different mechanisms appear to depend on intron size inasmuch as long introns might hinder the activity of the spliceosome through interference with the proper positioning of the intron-exon junctions. Indeed, it has been shown that the information content of splice sites positively correlates with intron length in the nematode, Drosophila, and fungi. We explored the connections between the length of vertebrate introns, the strength of splice sites, exonic splicing signals, and evolution of flanking exons.
A compensatory relationship is shown to exist between different types of signals, namely, the splice sites and the exonic splicing enhancers (ESEs). In the range of relatively short introns (approximately, < 1.5 kilobases in length), the enhancement of the splicing signals for longer introns was manifest in the increased concentration of ESEs. In contrast, for longer introns, this effect was not detectable, and instead, an increase in the strength of the donor and acceptor splice sites was observed. Conceivably, accumulation of A-rich ESE motifs beyond a certain limit is incompatible with functional constraints operating at the level of protein sequence evolution, which leads to compensation in the form of evolution of the splice sites themselves toward greater strength. In addition, however, a correlation between sequence conservation in the exon ends and intron length, particularly, in synonymous positions, was observed throughout the entire length range of introns. Thus, splicing signals other than the currently defined ESEs, i.e., potential new classes of ESEs, might exist in exon sequences, particularly, those that flank long introns.
Several weak but statistically significant correlations were observed between vertebrate intron length, splice site strength, and potential exonic splicing signals. Taken together, these findings attest to a compensatory relationship between splice sites and exonic splicing signals, depending on intron length.
PMCID: PMC1713244  PMID: 17156453
22.  The repertoire of olfactory C family G protein-coupled receptors in zebrafish: candidate chemosensory receptors for amino acids 
BMC Genomics  2006;7:309.
Vertebrate odorant receptors comprise at least three types of G protein-coupled receptors (GPCRs): the OR, V1R, and V2R/V2R-like receptors, the latter group belonging to the C family of GPCRs. These receptor families are thought to receive chemosensory information from a wide spectrum of odorant and pheromonal cues that influence critical animal behaviors such as feeding, reproduction and other social interactions.
Using genome database mining and other informatics approaches, we identified and characterized the repertoire of 54 intact "V2R-like" olfactory C family GPCRs in the zebrafish. Phylogenetic analysis – which also included a set of 34 C family GPCRs from fugu – places the fish olfactory receptors in three major groups, which are related to but clearly distinct from other C family GPCRs, including the calcium sensing receptor, metabotropic glutamate receptors, GABA-B receptor, T1R taste receptors, and the major group of V2R vomeronasal receptor families. Interestingly, an analysis of sequence conservation and selective pressure in the zebrafish receptors revealed the retention of a conserved sequence motif previously shown to be required for ligand binding in other amino acid receptors.
Based on our findings, we propose that the repertoire of zebrafish olfactory C family GPCRs has evolved to allow the detection and discrimination of a spectrum of amino acid and/or amino acid-based compounds, which are potent olfactory cues in fish. Furthermore, as the major groups of fish receptors and mammalian V2R receptors appear to have diverged significantly from a common ancestral gene(s), these receptors likely mediate chemosensation of different classes of chemical structures by their respective organisms.
PMCID: PMC1764893  PMID: 17156446
23.  Evolution of proteomes: fundamental signatures and global trends in amino acid compositions 
BMC Genomics  2006;7:307.
The evolutionary characterization of species and lifestyles at global levels is nowadays a subject of considerable interest, particularly with the availability of many complete genomes. Are there specific properties associated with lifestyles and phylogenies? What are the underlying evolutionary trends? One of the simplest analyses to address such questions concerns characterization of proteomes at the amino acids composition level.
In this work, amino acid compositions of a large set of 208 proteomes, with significant number of representatives from the three phylogenetic domains and different lifestyles are analyzed, resorting to an appropriate multidimensional method: Correspondence analysis. The analysis reveals striking discrimination between eukaryotes, prokaryotic mesophiles and hyperthemophiles-themophiles, following amino acid usage. In sharp contrast, no similar discrimination is observed for psychrophiles. The observed distributional properties are compared with various inferred chronologies for the recruitment of amino acids into the genetic code. Such comparisons reveal correlations between the observed segregations of species following amino acid usage, and the separation of amino acids following early or late recruitment.
A simple description of proteomes according to amino acid compositions reveals striking signatures, with sharp segregations or on the contrary non-discriminations following phylogenies and lifestyles. The distribution of species, following amino acid usage, exhibits a discrimination between [high GC]-[high optimal growth temperatures] and [low GC]-[moderate temperatures] characteristics. This discrimination appears to coincide closely with the separation of amino acids following their inferred early or late recruitment into the genetic code. Taken together the various results provide a consistent picture for the evolution of proteomes, in terms of amino acid usage.
PMCID: PMC1764020  PMID: 17147802
24.  Evolutionary anatomies of positions and types of disease-associated and neutral amino acid mutations in the human genome 
BMC Genomics  2006;7:306.
Amino acid mutations in a large number of human proteins are known to be associated with heritable genetic disease. These disease-associated mutations (DAMs) are known to occur predominantly in positions essential to the structure and function of the proteins. Here, we examine how the relative perpetuation and conservation of amino acid positions modulate the genome-wide patterns of 8,627 human disease-associated mutations (DAMs) reported in 541 genes. We compare these patterns with 5,308 non-synonymous Single Nucleotide Polymorphisms (nSNPs) in 2,592 genes from primary SNP resources.
The abundance of DAMs shows a negative relationship with the evolutionary rate of the amino acid positions harboring them. An opposite trend describes the distribution of nSNPs. DAMs are also preferentially found in the amino acid positions that are retained (or present) in multiple vertebrate species, whereas the nSNPs are over-abundant in the positions that have been lost (or absent) in the non-human vertebrates. These observations are consistent with the effect of purifying selection on natural variation, which also explains the existence of lower minor nSNP allele frequencies at highly-conserved amino acid positions. The biochemical severity of the inter-specific amino acid changes is also modulated by natural selection, with the fast-evolving positions containing more radical amino acid differences among species. Similarly, DAMs associated with early-onset diseases are more radical than those associated with the late-onset diseases. A small fraction of DAMs (10%) overlap with the amino acid differences between species within the same position, but are biochemically the most conservative group of amino acid differences in our datasets. Overlapping DAMs are found disproportionately in fast-evolving amino acid positions, which, along with the conservative nature of the amino acid changes, may have allowed some of them to escape natural selection until compensatory changes occur.
The consistency and predictability of genome-wide patterns of disease- associated and neutral amino acid variants reported here underscores the importance of the consideration of evolutionary rates of amino acid positions in clinical and population genetic analyses aimed at understanding the nature and fate of disease-associated and neutral population variation. Establishing such general patterns is an early step in efforts to diagnose the pathogenic potentials of novel amino acid mutations.
PMCID: PMC1702542  PMID: 17144929
25.  Ethanol sensitivity: a central role for CREB transcription regulation in the cerebellum 
BMC Genomics  2006;7:308.
Lowered sensitivity to the effects of ethanol increases the risk of developing alcoholism. Inbred mouse strains have been useful for the study of the genetic basis of various drug addiction-related phenotypes. Inbred Long-Sleep (ILS) and Inbred Short-Sleep (ISS) mice differentially express a number of genes thought to be implicated in sensitivity to the effects of ethanol. Concomitantly, there is evidence for a mediating role of cAMP/PKA/CREB signalling in aspects of alcoholism modelled in animals. In this report, the extent to which CREB signalling impacts the differential expression of genes in ILS and ISS mouse cerebella is examined.
A training dataset for Machine Learning (ML) and Exploratory Data Analyses (EDA) was generated from promoter region sequences of a set of genes known to be targets of CREB transcription regulation and a set of genes whose transcription regulations are potentially CREB-independent. For each promoter sequence, a vector of size 132, with elements characterizing nucleotide composition features was generated. Genes whose expressions have been previously determined to be increased in ILS or ISS cerebella were identified, and their CREB regulation status predicted using the ML scheme C4.5. The C4.5 learning scheme was used because, of four ML schemes evaluated, it had the lowest predicted error rate. On an independent evaluation set of 21 genes of known CREB regulation status, C4.5 correctly classified 81% of instances with F-measures of 0.87 and 0.67 respectively for the CREB-regulated and CREB-independent classes. Additionally, six out of eight genes previously determined by two independent microarray platforms to be up-regulated in the ILS or ISS cerebellum were predicted by C4.5 to be transcriptionally regulated by CREB. Furthermore, 64% and 52% of a cross-section of other up-regulated cerebellar genes in ILS and ISS mice, respectively, were deemed to be CREB-regulated.
These observations collectively suggest that ethanol sensitivity, as it relates to the cerebellum, may be associated with CREB transcription activity.
PMCID: PMC1698922  PMID: 17147806

Results 1-25 (330)