PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (816)
 

Clipboard (0)
None
Journals
Year of Publication
1.  Smed454 dataset: unravelling the transcriptome of Schmidtea mediterranea 
BMC Genomics  2010;11:731.
Background
Freshwater planarians are an attractive model for regeneration and stem cell research and have become a promising tool in the field of regenerative medicine. With the availability of a sequenced planarian genome, the recent application of modern genetic and high-throughput tools has resulted in revitalized interest in these animals, long known for their amazing regenerative capabilities, which enable them to regrow even a new head after decapitation. However, a detailed description of the planarian transcriptome is essential for future investigation into regenerative processes using planarians as a model system.
Results
In order to complement and improve existing gene annotations, we used a 454 pyrosequencing approach to analyze the transcriptome of the planarian species Schmidtea mediterranea Altogether, 598,435 454-sequencing reads, with an average length of 327 bp, were assembled together with the ~10,000 sequences of the S. mediterranea UniGene set using different similarity cutoffs. The assembly was then mapped onto the current genome data. Remarkably, our Smed454 dataset contains more than 3 million novel transcribed nucleotides sequenced for the first time. A descriptive analysis of planarian splice sites was conducted on those Smed454 contigs that mapped univocally to the current genome assembly. Sequence analysis allowed us to identify genes encoding putative proteins with defined structural properties, such as transmembrane domains. Moreover, we annotated the Smed454 dataset using Gene Ontology, and identified putative homologues of several gene families that may play a key role during regeneration, such as neurotransmitter and hormone receptors, homeobox-containing genes, and genes related to eye function.
Conclusions
We report the first planarian transcript dataset, Smed454, as an open resource tool that can be accessed via a web interface. Smed454 contains significant novel sequence information about most expressed genes of S. mediterranea. Analysis of the annotated data promises to contribute to identification of gene families poorly characterized at a functional level. The Smed454 transcriptome data will assist in the molecular characterization of S. mediterranea as a model organism, which will be useful to a broad scientific community.
doi:10.1186/1471-2164-11-731
PMCID: PMC3022928  PMID: 21194483
2.  The complete genome sequence of Corynebacterium pseudotuberculosis FRC41 isolated from a 12-year-old girl with necrotizing lymphadenitis reveals insights into gene-regulatory networks contributing to virulence 
BMC Genomics  2010;11:728.
Background
Corynebacterium pseudotuberculosis is generally regarded as an important animal pathogen that rarely infects humans. Clinical strains are occasionally recovered from human cases of lymphadenitis, such as C. pseudotuberculosis FRC41 that was isolated from the inguinal lymph node of a 12-year-old girl with necrotizing lymphadenitis. To detect potential virulence factors and corresponding gene-regulatory networks in this human isolate, the genome sequence of C. pseudotuberculosis FCR41 was determined by pyrosequencing and functionally annotated.
Results
Sequencing and assembly of the C. pseudotuberculosis FRC41 genome yielded a circular chromosome with a size of 2,337,913 bp and a mean G+C content of 52.2%. Specific gene sets associated with iron and zinc homeostasis were detected among the 2,110 predicted protein-coding regions and integrated into a gene-regulatory network that is linked with both the central metabolism and the oxidative stress response of FRC41. Two gene clusters encode proteins involved in the sortase-mediated polymerization of adhesive pili that can probably mediate the adherence to host tissue to facilitate additional ligand-receptor interactions and the delivery of virulence factors. The prominent virulence factors phospholipase D (Pld) and corynebacterial protease CP40 are encoded in the genome of this human isolate. The genome annotation revealed additional serine proteases, neuraminidase H, nitric oxide reductase, an invasion-associated protein, and acyl-CoA carboxylase subunits involved in mycolic acid biosynthesis as potential virulence factors. The cAMP-sensing transcription regulator GlxR plays a key role in controlling the expression of several genes contributing to virulence.
Conclusion
The functional data deduced from the genome sequencing and the extended knowledge of virulence factors indicate that the human isolate C. pseudotuberculosis FRC41 is equipped with a distinct gene set promoting its survival under unfavorable environmental conditions encountered in the mammalian host.
doi:10.1186/1471-2164-11-728
PMCID: PMC3022926  PMID: 21192786
3.  SPC-P1: a pathogenicity-associated prophage of Salmonella paratyphi C 
BMC Genomics  2010;11:729.
Background
Salmonella paratyphi C is one of the few human-adapted pathogens along with S. typhi, S. paratyphi A and S. paratyphi B that cause typhoid, but it is not clear whether these bacteria cause the disease by the same or different pathogenic mechanisms. Notably, these typhoid agents have distinct sets of large genomic insertions, which may encode different pathogenicity factors. Previously we identified a novel prophage, SPC-P1, in S. paratyphi C RKS4594 and wondered whether it might be involved in pathogenicity of the bacteria.
Results
We analyzed the sequence of SPC-P1 and found that it is an inducible phage with an overall G+C content of 47.24%, similar to that of most Salmonella phages such as P22 and ST64T but significantly lower than the 52.16% average of the RKS4594 chromosome. Electron microscopy showed short-tailed phage particles very similar to the lambdoid phage CUS-3. To evaluate its roles in pathogenicity, we lysogenized S. paratyphi C strain CN13/87, which did not have this prophage, and infected mice with the lysogenized CN13/87. Compared to the phage-free wild type CN13/87, the lysogenized CN13/87 exhibited significantly increased virulence and caused multi-organ damages in mice at considerably lower infection doses.
Conclusions
SPC-P1 contributes pathogenicity to S. paratyphi C in animal infection models, so it is possible that this prophage is involved in typhoid pathogenesis in humans. Genetic and functional analyses of SPC-P1 may facilitate the study of pathogenic evolution of the extant typhoid agents, providing particular help in elucidating the pathogenic determinants of the typhoid agents.
doi:10.1186/1471-2164-11-729
PMCID: PMC3022927  PMID: 21192789
4.  Transcriptome analysis of grain-filling caryopses reveals involvement of multiple regulatory pathways in chalky grain formation in rice 
BMC Genomics  2010;11:730.
Background
Grain endosperm chalkiness of rice is a varietal characteristic that negatively affects not only the appearance and milling properties but also the cooking texture and palatability of cooked rice. However, grain chalkiness is a complex quantitative genetic trait and the molecular mechanisms underlying its formation are poorly understood.
Results
A near-isogenic line CSSL50-1 with high chalkiness was compared with its normal parental line Asominori for grain endosperm chalkiness. Physico-biochemical analyses of ripened grains showed that, compared with Asominori, CSSL50-1 contains higher levels of amylose and 8 DP (degree of polymerization) short-chain amylopectin, but lower medium length 12 DP amylopectin. Transcriptome analysis of 15 DAF (day after flowering) caryopses of the isogenic lines identified 623 differential expressed genes (P < 0.01), among which 324 genes are up-regulated and 299 down-regulated. These genes were classified into 18 major categories, with 65.3% of them belong to six major functional groups: signal transduction, cell rescue/defense, transcription, protein degradation, carbohydrate metabolism and redox homeostasis. Detailed pathway dissection demonstrated that genes involved in sucrose and starch synthesis are up-regulated, whereas those involved in non-starch polysaccharides are down regulated. Several genes involved in oxidoreductive homeostasis were found to have higher expression levels in CSSL50-1 as well, suggesting potential roles of ROS in grain chalkiness formation.
Conclusion
Extensive gene expression changes were detected during rice grain chalkiness formation. Over half of these differentially expressed genes are implicated in several important categories of genes, including signal transduction, transcription, carbohydrate metabolism and redox homeostasis, suggesting that chalkiness formation involves multiple metabolic and regulatory pathways.
doi:10.1186/1471-2164-11-730
PMCID: PMC3023816  PMID: 21192807
5.  Population- and genome-specific patterns of linkage disequilibrium and SNP variation in spring and winter wheat (Triticum aestivum L.) 
BMC Genomics  2010;11:727.
Background
Single nucleotide polymorphisms (SNPs) are ideally suited for the construction of high-resolution genetic maps, studying population evolutionary history and performing genome-wide association mapping experiments. Here, we used a genome-wide set of 1536 SNPs to study linkage disequilibrium (LD) and population structure in a panel of 478 spring and winter wheat cultivars (Triticum aestivum) from 17 populations across the United States and Mexico.
Results
Most of the wheat oligo pool assay (OPA) SNPs that were polymorphic within the complete set of 478 cultivars were also polymorphic in all subpopulations. Higher levels of genetic differentiation were observed among wheat lines within populations than among populations. A total of nine genetically distinct clusters were identified, suggesting that some of the pre-defined populations shared significant proportion of genetic ancestry. Estimates of population structure (FST) at individual loci showed a high level of heterogeneity across the genome. In addition, seven genomic regions with elevated FST were detected between the spring and winter wheat populations. Some of these regions overlapped with previously mapped flowering time QTL. Across all populations, the highest extent of significant LD was observed in the wheat D-genome, followed by lower LD in the A- and B-genomes. The differences in the extent of LD among populations and genomes were mostly driven by differences in long-range LD ( > 10 cM).
Conclusions
Genome- and population-specific patterns of genetic differentiation and LD were discovered in the populations of wheat cultivars from different geographic regions. Our study demonstrated that the estimates of population structure between spring and winter wheat lines can identify genomic regions harboring candidate genes involved in the regulation of growth habit. Variation in LD suggests that breeding and selection had a different impact on each wheat genome both within and among populations. The higher extent of LD in the wheat D-genome versus the A- and B-genomes likely reflects the episodes of recent introgression and population bottleneck accompanying the origin of hexaploid wheat. The assessment of LD and population structure in this assembled panel of diverse lines provides critical information for the development of genetic resources for genome-wide association mapping of agronomically important traits in wheat.
doi:10.1186/1471-2164-11-727
PMCID: PMC3020227  PMID: 21190581
6.  De novo assembly and characterization of root transcriptome using Illumina paired-end sequencing and development of cSSR markers in sweetpotato (Ipomoea batatas) 
BMC Genomics  2010;11:726.
Background
The tuberous root of sweetpotato is an important agricultural and biological organ. There are not sufficient transcriptomic and genomic data in public databases for understanding of the molecular mechanism underlying the tuberous root formation and development. Thus, high throughput transcriptome sequencing is needed to generate enormous transcript sequences from sweetpotato root for gene discovery and molecular marker development.
Results
In this study, more than 59 million sequencing reads were generated using Illumina paired-end sequencing technology. De novo assembly yielded 56,516 unigenes with an average length of 581 bp. Based on sequence similarity search with known proteins, a total of 35,051 (62.02%) genes were identified. Out of these annotated unigenes, 5,046 and 11,983 unigenes were assigned to gene ontology and clusters of orthologous group, respectively. Searching against the Kyoto Encyclopedia of Genes and Genomes Pathway database (KEGG) indicated that 17,598 (31.14%) unigenes were mapped to 124 KEGG pathways, and 11,056 were assigned to metabolic pathways, which were well represented by carbohydrate metabolism and biosynthesis of secondary metabolite. In addition, 4,114 cDNA SSRs (cSSRs) were identified as potential molecular markers in our unigenes. One hundred pairs of PCR primers were designed and used for validation of the amplification and assessment of the polymorphism in genomic DNA pools. The result revealed that 92 primer pairs were successfully amplified in initial screening tests.
Conclusion
This study generated a substantial fraction of sweetpotato transcript sequences, which can be used to discover novel genes associated with tuberous root formation and development and will also make it possible to construct high density microarrays for further characterization of gene expression profiles during these processes. Thousands of cSSR markers identified in the present study can enrich molecular markers and will facilitate marker-assisted selection in sweetpotato breeding. Overall, these sequences and markers will provide valuable resources for the sweetpotato community. Additionally, these results also suggested that transcriptome analysis based on Illumina paired-end sequencing is a powerful tool for gene discovery and molecular marker development for non-model species, especially those with large and complex genome.
doi:10.1186/1471-2164-11-726
PMCID: PMC3016421  PMID: 21182800
7.  Analysis of genomic differences among Clostridium botulinum type A1 strains 
BMC Genomics  2010;11:725.
Background
Type A1 Clostridium botulinum strains are a group of Gram-positive, spore-forming anaerobic bacteria that produce a genetically, biochemically, and biophysically indistinguishable 150 kD protein that causes botulism. The genomes of three type A1 C. botulinum strains have been sequenced and show a high degree of synteny. The purpose of this study was to characterize differences among these genomes and compare these differentiating features with two additional unsequenced strains used in previous studies.
Results
Several strategies were deployed in this report. First, University of Massachusetts Dartmouth laboratory Hall strain (UMASS strain) neurotoxin gene was amplified by PCR and sequenced; its sequence was aligned with the published ATCC 3502 Sanger Institute Hall strain and Allergan Hall strain neurotoxin gene regions. Sequence alignment showed that there was a synonymous single nucleotide polymorphism (SNP) in the region encoding the heavy chain between Allergan strain and ATCC 3502 and UMASS strains. Second, comparative genomic hybridization (CGH) demonstrated that the UMASS strain and a strain expected to be derived from ATCC 3502 in the Centers for Disease Control and Prevention (CDC) laboratory (ATCC 3502*) differed in gene content compared to the ATCC 3502 genome sequence published by the Sanger Institute. Third, alignment of the three sequenced C. botulinum type A1 strain genomes revealed the presence of four comparable blocks. Strains ATCC 3502 and ATCC 19397 share the same genome organization, while the organization of the blocks in strain Hall were switched. Lastly, PCR was designed to identify UMASS and ATCC 3502* strain genome organizations. The PCR results indicated that UMASS strain belonged to Hall type and ATCC 3502* strain was identical to ATCC 3502 (Sanger Institute) type.
Conclusions
Taken together, C. botulinum type A1 strains including Sanger Institute ATCC 3502, ATCC 3502*, ATCC 19397, Hall, Allergan, and UMASS strains demonstrate differences at the level of the neurotoxin gene sequence, in gene content, and in genome arrangement.
doi:10.1186/1471-2164-11-725
PMCID: PMC3038992  PMID: 21182778
8.  Whole genome sequencing of Saccharomyces cerevisiae: from genotype to phenotype for improved metabolic engineering applications 
BMC Genomics  2010;11:723.
Background
The need for rapid and efficient microbial cell factory design and construction are possible through the enabling technology, metabolic engineering, which is now being facilitated by systems biology approaches. Metabolic engineering is often complimented by directed evolution, where selective pressure is applied to a partially genetically engineered strain to confer a desirable phenotype. The exact genetic modification or resulting genotype that leads to the improved phenotype is often not identified or understood to enable further metabolic engineering.
Results
In this work we performed whole genome high-throughput sequencing and annotation can be used to identify single nucleotide polymorphisms (SNPs) between Saccharomyces cerevisiae strains S288c and CEN.PK113-7D. The yeast strain S288c was the first eukaryote sequenced, serving as the reference genome for the Saccharomyces Genome Database, while CEN.PK113-7D is a preferred laboratory strain for industrial biotechnology research. A total of 13,787 high-quality SNPs were detected between both strains (reference strain: S288c). Considering only metabolic genes (782 of 5,596 annotated genes), a total of 219 metabolism specific SNPs are distributed across 158 metabolic genes, with 85 of the SNPs being nonsynonymous (e.g., encoding amino acid modifications). Amongst metabolic SNPs detected, there was pathway enrichment in the galactose uptake pathway (GAL1, GAL10) and ergosterol biosynthetic pathway (ERG8, ERG9). Physiological characterization confirmed a strong deficiency in galactose uptake and metabolism in S288c compared to CEN.PK113-7D, and similarly, ergosterol content in CEN.PK113-7D was significantly higher in both glucose and galactose supplemented cultivations compared to S288c. Furthermore, DNA microarray profiling of S288c and CEN.PK113-7D in both glucose and galactose batch cultures did not provide a clear hypothesis for major phenotypes observed, suggesting that genotype to phenotype correlations are manifested post-transcriptionally or post-translationally either through protein concentration and/or function.
Conclusions
With an intensifying need for microbial cell factories that produce a wide array of target compounds, whole genome high-throughput sequencing and annotation for SNP detection can aid in better reducing and defining the metabolic landscape. This work demonstrates direct correlations between genotype and phenotype that provides clear and high-probability of success metabolic engineering targets. The genome sequence, annotation, and a SNP viewer of CEN.PK113-7D are deposited at http://www.sysbio.se/cenpk.
doi:10.1186/1471-2164-11-723
PMCID: PMC3022925  PMID: 21176163
9.  Accounting for multiple comparisons in a genome-wide association study (GWAS) 
BMC Genomics  2010;11:724.
Background
As we enter an era when testing millions of SNPs in a single gene association study will become the standard, consideration of multiple comparisons is an essential part of determining statistical significance. Bonferroni adjustments can be made but are conservative due to the preponderance of linkage disequilibrium (LD) between genetic markers, and permutation testing is not always a viable option. Three major classes of corrections have been proposed to correct the dependent nature of genetic data in Bonferroni adjustments: permutation testing and related alternatives, principal components analysis (PCA), and analysis of blocks of LD across the genome. We consider seven implementations of these commonly used methods using data from 1514 European American participants genotyped for 700,078 SNPs in a GWAS for AIDS.
Results
A Bonferroni correction using the number of LD blocks found by the three algorithms implemented by Haploview resulted in an insufficiently conservative threshold, corresponding to a genome-wide significance level of α = 0.15 - 0.20. We observed a moderate increase in power when using PRESTO, SLIDE, and simpleℳ when compared with traditional Bonferroni methods for population data genotyped on the Affymetrix 6.0 platform in European Americans (α = 0.05 thresholds between 1 × 10-7 and 7 × 10-8).
Conclusions
Correcting for the number of LD blocks resulted in an anti-conservative Bonferroni adjustment. SLIDE and simpleℳ are particularly useful when using a statistical test not handled in optimized permutation testing packages, and genome-wide corrected p-values using SLIDE, are much easier to interpret for consumers of GWAS studies.
doi:10.1186/1471-2164-11-724
PMCID: PMC3023815  PMID: 21176216
10.  Tumor and reproductive traits are linked by RNA metabolism genes in the mouse ovary: a transcriptome-phenotype association analysis 
BMC Genomics  2010;11(Suppl 5):S1.
Background
The link between reproductive life history and incidence of ovarian tumors is well known. Periods of reduced ovulations may confer protection against ovarian cancer. Using phenotypic data available for mouse, a possible association between the ovarian transcriptome, reproductive records and spontaneous ovarian tumor rates was investigated in four mouse inbred strains. NIA15k-DNA microarrays were employed to obtain expression profiles of BalbC, C57BL6, FVB and SWR adult ovaries.
Results
Linear regression analysis with multiple-test control (adjusted p ≤ 0.05) resulted in ovarian tumor frequency (OTF) and number of litters (NL) as the top-correlated among five tested phenotypes. Moreover, nearly one-hundred genes were coincident between these two traits and were decomposed in 76 OTF(–) NL(+) and 20 OTF(+) NL(–) genes, where the plus/minus signs indicate the direction of correlation. Enriched functional categories were RNA-binding/mRNA-processing and protein folding in the OTF(–) NL(+) and the OTF(+) NL(–) subsets, respectively. In contrast, no associations were detected between OTF and litter size (LS), the latter a measure of ovulation events in a single estrous cycle.
Conclusion
Literature text-mining pointed to post-transcriptional control of ovarian processes including oocyte maturation, folliculogenesis and angiogenesis as possible causal relationships of observed tumor and reproductive phenotypes. We speculate that repetitive cycling instead of repetitive ovulations represent the actual link between ovarian tumorigenesis and reproductive records.
doi:10.1186/1471-2164-11-S5-S1
PMCID: PMC3045792  PMID: 21210965
11.  Decreasing the number of false positives in sequence classification 
BMC Genomics  2010;11(Suppl 5):S10.
Background
A large number of probabilistic models used in sequence analysis assign non-zero probability values to most input sequences. To decide when a given probability is sufficient the most common way is bayesian binary classification, where the probability of the model characterizing the sequence family of interest is compared to that of an alternative probability model. We can use as alternative model a null model. This is the scoring technique used by sequence analysis tools such as HMMER, SAM and INFERNAL. The most prevalent null models are position-independent residue distributions that include: the uniform distribution, genomic distribution, family-specific distribution and the target sequence distribution. This paper presents a study to evaluate the impact of the choice of a null model in the final result of classifications. In particular, we are interested in minimizing the number of false predictions in a classification. This is a crucial issue to reduce costs of biological validation.
Results
For all the tests, the target null model presented the lowest number of false positives, when using random sequences as a test. The study was performed in DNA sequences using GC content as the measure of content bias, but the results should be valid also for protein sequences. To broaden the application of the results, the study was performed using randomly generated sequences. Previous studies were performed on aminoacid sequences, using only one probabilistic model (HMM) and on a specific benchmark, and lack more general conclusions about the performance of null models. Finally, a benchmark test with P. falciparum confirmed these results.
Conclusions
Of the evaluated models the best suited for classification are the uniform model and the target model. However, the use of the uniform model presents a GC bias that can cause more false positives for candidate sequences with extreme compositional bias, a characteristic not described in previous studies. In these cases the target model is more dependable for biological validation due to its higher specificity.
doi:10.1186/1471-2164-11-S5-S10
PMCID: PMC3045793  PMID: 21210966
12.  The role of exon shuffling in shaping protein-protein interaction networks 
BMC Genomics  2010;11(Suppl 5):S11.
Background
Physical protein-protein interaction (PPI) is a critical phenomenon for the function of most proteins in living organisms and a significant fraction of PPIs are the result of domain-domain interactions. Exon shuffling, intron-mediated recombination of exons from existing genes, is known to have been a major mechanism of domain shuffling in metazoans. Thus, we hypothesized that exon shuffling could have a significant influence in shaping the topology of PPI networks.
Results
We tested our hypothesis by compiling exon shuffling and PPI data from six eukaryotic species: Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Cryptococcus neoformans and Arabidopsis thaliana. For all four metazoan species, genes enriched in exon shuffling events presented on average higher vertex degree (number of interacting partners) in PPI networks. Furthermore, we verified that a set of protein domains that are simultaneously promiscuous (known to interact to multiple types of other domains), self-interacting (able to interact with another copy of themselves) and abundant in the genomes presents a stronger signal for exon shuffling.
Conclusions
Exon shuffling appears to have been a recurrent mechanism for the emergence of new PPIs along metazoan evolution. In metazoan genomes, exon shuffling also promoted the expansion of some protein domains. We speculate that their promiscuous and self-interacting properties may have been decisive for that expansion.
doi:10.1186/1471-2164-11-S5-S11
PMCID: PMC3045794  PMID: 21210967
13.  RNA interference-mediated knockdown of CD49e (α5 integrin chain) in human thymic epithelial cells modulates the expression of multiple genes and decreases thymocyte adhesion 
BMC Genomics  2010;11(Suppl 5):S2.
Background
The thymus is a central lymphoid organ, in which bone marrow-derived T cell precursors undergo a complex process of maturation. Developing thymocytes interact with thymic microenvironment in a defined spatial order. A component of thymic microenvironment, the thymic epithelial cells, is crucial for the maturation of T-lymphocytes through cell-cell contact, cell matrix interactions and secretory of cytokines/chemokines. There is evidence that extracellular matrix molecules play a fundamental role in guiding differentiating thymocytes in both cortical and medullary regions of the thymic lobules. The interaction between the integrin α5β1 (CD49e/CD29; VLA-5) and fibronectin is relevant for thymocyte adhesion and migration within the thymic tissue. Our previous results have shown that adhesion of thymocytes to cultured TEC line is enhanced in the presence of fibronectin, and can be blocked with anti-VLA-5 antibody.
Results
Herein, we studied the role of CD49e expressed by the human thymic epithelium. For this purpose we knocked down the CD49e by means of RNA interference. This procedure resulted in the modulation of more than 100 genes, some of them coding for other proteins also involved in adhesion of thymocytes; others related to signaling pathways triggered after integrin activation, or even involved in the control of F-actin stress fiber formation. Functionally, we demonstrated that disruption of VLA-5 in human TEC by CD49e-siRNA-induced gene knockdown decreased the ability of TEC to promote thymocyte adhesion. Such a decrease comprised all CD4/CD8-defined thymocyte subsets.
Conclusion
Conceptually, our findings unravel the complexity of gene regulation, as regards key genes involved in the heterocellular cell adhesion between developing thymocytes and the major component of the thymic microenvironment, an interaction that is a mandatory event for proper intrathymic T cell differentiation.
doi:10.1186/1471-2164-11-S5-S2
PMCID: PMC3045795  PMID: 21210968
14.  Disclosing ambiguous gene aliases by automatic literature profiling 
BMC Genomics  2010;11(Suppl 5):S3.
Background
Retrieving pertinent information from biological scientific literature requires cutting-edge text mining methods which may be able to recognize the meaning of the very ambiguous names of biological entities. Aliases of a gene share a common vocabulary in their respective collections of PubMed abstracts. This may be true even when these aliases are not associated with the same subset of documents. This gene-specific vocabulary defines a unique fingerprint that can be used to disclose ambiguous aliases. The present work describes an original method for automatically assessing the ambiguity levels of gene aliases in large gene terminologies based exclusively in the content of their associated literature. The method can deal with the two major problems restricting the usage of current text mining tools: 1) different names associated with the same gene; and 2) one name associated with multiple genes, or even with non-gene entities. Important, this method does not require training examples.
Results
Aliases were considered “ambiguous” when their Jaccard distance to the respective official gene symbol was equal or greater than the smallest distance between the official gene symbol and one of the three internal controls (randomly picked unrelated official gene symbols). Otherwise, they were assigned the status of “synonyms”. We evaluated the coherence of the results by comparing the frequencies of the official gene symbols in the text corpora retrieved with their respective “synonyms” or “ambiguous” aliases. Official gene symbols were mentioned in the abstract collections of 42 % (70/165) of their respective synonyms. No official gene symbol occurred in the abstract collections of any of their respective ambiguous aliases. In overall, querying PubMed with official gene symbols and “synonym” aliases allowed a 3.6-fold increase in the number of unique documents retrieved.
Conclusions
These results confirm that this method is able to distinguish between synonyms and ambiguous gene aliases based exclusively on their vocabulary fingerprint. The approach we describe could be used to enhance the retrieval of relevant literature related to a gene.
doi:10.1186/1471-2164-11-S5-S3
PMCID: PMC3045796  PMID: 21210969
15.  Alternative splicing enriched cDNA libraries identify breast cancer-associated transcripts 
BMC Genomics  2010;11(Suppl 5):S4.
Background
Alternative splicing (AS) is a central mechanism in the generation of genomic complexity and is a major contributor to transcriptome and proteome diversity. Alterations of the splicing process can lead to deregulation of crucial cellular processes and have been associated with a large spectrum of human diseases. Cancer-associated transcripts are potential molecular markers and may contribute to the development of more accurate diagnostic and prognostic methods and also serve as therapeutic targets. Alternative splicing-enriched cDNA libraries have been used to explore the variability generated by alternative splicing. In this study, by combining the use of trapping heteroduplexes and RNA amplification, we developed a powerful approach that enables transcriptome-wide exploration of the AS repertoire for identifying AS variants associated with breast tumor cells modulated by ERBB2 (HER-2/neu) oncogene expression.
Results
The human breast cell line (C5.2) and a pool of 5 ERBB2 over-expressing breast tumor samples were used independently for the construction of two AS-enriched libraries. In total, 2,048 partial cDNA sequences were obtained, revealing 214 alternative splicing sequence-enriched tags (ASSETs). A subset with 79 multiple exon ASSETs was compared to public databases and reported 138 different AS events. A high success rate of RT-PCR validation (94.5%) was obtained, and 2 novel AS events were identified. The influence of ERBB2-mediated expression on AS regulation was evaluated by capillary electrophoresis and probe-ligation approaches in two mammary cell lines (Hb4a and C5.2) expressing different levels of ERBB2. The relative expression balance between AS variants from 3 genes was differentially modulated by ERBB2 in this model system.
Conclusions
In this study, we presented a method for exploring AS from any RNA source in a transcriptome-wide format, which can be directly easily adapted to next generation sequencers. We identified AS transcripts that were differently modulated by ERBB2-mediated expression and that can be tested as molecular markers for breast cancer. Such a methodology will be useful for completely deciphering the cancer cell transcriptome diversity resulting from AS and for finding more precise molecular markers.
doi:10.1186/1471-2164-11-S5-S4
PMCID: PMC3045797  PMID: 21210970
16.  How does heparin prevent the pH inactivation of cathepsin B? Allosteric mechanism elucidated by docking and molecular dynamics 
BMC Genomics  2010;11(Suppl 5):S5.
Background
Cathepsin B (catB) is a promising target for anti-cancer drug design due to its implication in several steps of tumorigenesis. catB activity and inhibition are pH-dependent, making it difficult to identify efficient inhibitor candidates for clinical trials. In addition it is known that heparin binding stabilizes the enzyme in alkaline conditions. However, the molecular mechanism of stabilization is not well understood, indicating the need for more detailed structural and dynamic studies in order to clarify the influence of pH and heparin binding on catB stability.
Results
Our pKa calculations of catB titratable residues revealed distinct protonation states under different pH conditions for six key residues, of which four lie in the crucial interdomain interface. This implies changes in the overall charge distribution at the catB surface, as revealed by calculation of the electrostatic potential. We identified two basic surface regions as possible heparin binding sites, which were confirmed by docking calculations. Molecular dynamics (MD) of both apo catB and catB-heparin complexes were performed using protonation states for catB residues corresponding to the relevant acidic or alkaline conditions. The MD of apo catB at pH 5.5 was very stable, and presented the highest number and occupancy of hydrogen bonds within the inter-domain interface. In contrast, under alkaline conditions the enzyme's overall flexibility was increased: interactions between active site residues were lost, helical content decreased, and domain separation was observed as well as high-amplitude motions of the occluding loop – a main target of drug design studies. Essential dynamics analysis revealed that heparin binding modulates large amplitude motions promoting rearrangement of contacts between catB domains, thus favoring the maintenance of helical content as well as active site stability.
Conclusions
The results of our study contribute to unraveling the molecular events involved in catB inactivation in alkaline pH, highlighting the fact that protonation changes of few residues can alter the overall dynamics of an enzyme. Moreover, we propose an allosteric role for heparin in the regulation of catB stability in such a manner that the restriction of enzyme flexibility would allow the establishment of stronger contacts and thus the maintenance of overall structure.
doi:10.1186/1471-2164-11-S5-S5
PMCID: PMC3045798  PMID: 21210971
17.  Mining flexible-receptor docking experiments to select promising protein receptor snapshots 
BMC Genomics  2010;11(Suppl 5):S6.
Background
Molecular docking simulation is the Rational Drug Design (RDD) step that investigates the affinity between protein receptors and ligands. Typically, molecular docking algorithms consider receptors as rigid bodies. Receptors are, however, intrinsically flexible in the cellular environment. The use of a time series of receptor conformations is an approach to explore its flexibility in molecular docking computer simulations, but it is extensively time-consuming. Hence, selection of the most promising conformations can accelerate docking experiments and, consequently, the RDD efforts.
Results
We previously docked four ligands (NADH, TCL, PIF and ETH) to 3,100 conformations of the InhA receptor from M. tuberculosis. Based on the receptor residues-ligand distances we preprocessed all docking results to generate appropriate input to mine data. Data preprocessing was done by calculating the shortest interatomic distances between the ligand and the receptor’s residues for each docking result. They were the predictive attributes. The target attribute was the estimated free-energy of binding (FEB) value calculated by the AutodDock3.0.5 software. The mining inputs were submitted to the M5P model tree algorithm. It resulted in short and understandable trees. On the basis of the correlation values, for NADH, TCL and PIF we obtained more than 95% correlation while for ETH, only about 60%. Post processing the generated model trees for each of its linear models (LMs), we calculated the average FEB for their associated instances. From these values we considered a LM as representative if its average FEB was smaller than or equal the average FEB of the test set. The instances in the selected LMs were considered the most promising snapshots. It totalized 1,521, 1,780, 2,085 and 902 snapshots, for NADH, TCL, PIF and ETH respectively.
Conclusions
By post processing the generated model trees we were able to propose a criterion of selection of linear models which, in turn, is capable of selecting a set of promising receptor conformations. As future work we intend to go further and use these results to elaborate a strategy to preprocess the receptors 3-D spatial conformation in order to predict FEB values. Besides, we intend to select other compounds, among the million catalogued, that may be promising as new drug candidates for our particular protein receptor target.
doi:10.1186/1471-2164-11-S5-S6
PMCID: PMC3045799  PMID: 21210972
18.  Unraveling the molecular mechanisms of nitrogenase conformational protection against oxygen in diazotrophic bacteria 
BMC Genomics  2010;11(Suppl 5):S7.
Background
G. diazotrophicus and A. vinelandii are aerobic nitrogen-fixing bacteria. Although oxygen is essential for the survival of these organisms, it irreversibly inhibits nitrogenase, the complex responsible for nitrogen fixation. Both microorganisms deal with this paradox through compensatory mechanisms. In A. vinelandii a conformational protection mechanism occurs through the interaction between the nitrogenase complex and the FeSII protein. Previous studies suggested the existence of a similar system in G. diazotrophicus, but the putative protein involved was not yet described. This study intends to identify the protein coding gene in the recently sequenced genome of G. diazotrophicus and also provide detailed structural information of nitrogenase conformational protection in both organisms.
Results
Genomic analysis of G. diazotrophicus sequences revealed a protein coding ORF (Gdia0615) enclosing a conserved “fer2” domain, typical of the ferredoxin family and found in A. vinelandii FeSII. Comparative models of both FeSII and Gdia0615 disclosed a conserved beta-grasp fold. Cysteine residues that coordinate the 2[Fe-S] cluster are in conserved positions towards the metallocluster. Analysis of solvent accessible residues and electrostatic surfaces unveiled an hydrophobic dimerization interface. Dimers assembled by molecular docking presented a stable behaviour and a proper accommodation of regions possibly involved in binding of FeSII to nitrogenase throughout molecular dynamics simulations in aqueous solution. Molecular modeling of the nitrogenase complex of G. diazotrophicus was performed and models were compared to the crystal structure of A. vinelandii nitrogenase. Docking experiments of FeSII and Gdia0615 with its corresponding nitrogenase complex pointed out in both systems a putative binding site presenting shape and charge complementarities at the Fe-protein/MoFe-protein complex interface.
Conclusions
The identification of the putative FeSII coding gene in G. diazotrophicus genome represents a large step towards the understanding of the conformational protection mechanism of nitrogenase against oxygen. In addition, this is the first study regarding the structural complementarities of FeSII-nitrogenase interactions in diazotrophic bacteria. The combination of bioinformatic tools for genome analysis, comparative protein modeling, docking calculations and molecular dynamics provided a powerful strategy for the elucidation of molecular mechanisms and structural features of FeSII-nitrogenase interaction.
doi:10.1186/1471-2164-11-S5-S7
PMCID: PMC3045800  PMID: 21210973
19.  SIGLa: an adaptable LIMS for multiple laboratories 
BMC Genomics  2010;11(Suppl 5):S8.
Background
The need to manage large amounts of data is a clear demand for laboratories nowadays. The use of Laboratory Information Management Systems (LIMS) to achieve this is growing each day. A LIMS is a complex computational system used to manage laboratory data with emphasis in quality assurance. Several LIMS are available currently. However, most of them have proprietary code and are commercialized with a high cost. Moreover, due to its complexity, LIMS are usually designed to comply with the needs of one kind of laboratory, making it very difficult to reuse a LIMS. In this work we describe the Sistema Integrado de Gerência de Laboratórios (SIGLa), an open source LIMS with a new approach designed to allow it to adapt its activities and processes to various types of laboratories.
Results
SIGLa incorporates a workflow management system, making it possible to create and manage customized workflows. For each new laboratory a workflow is defined with its activities, rules and procedures. During the execution, for each workflow created, the values of attributes defined in a XPDL file (which describe the workflow) are stored in SIGLa’s database, allowing then to be managed and retrieved upon request. These characteristics increase system’s flexibility and extend its usability to include the needs of multiple types of laboratories. To construct the main functionalities of SIGLa a workflow of a proteomic laboratory was first defined. To validate the SIGLa capability of adapting to multiples laboratories, on this paper we study theprocess and the needs of a microarray laboratory and define its workflow. This workflow has been defined in a period of about two weeks, showing the efficiency and flexibility of the tool.
Conclusions
Using SIGLa it has been possible to construct a microarray LIMS in a few days illustrating the flexibility and power of the method proposed. With SIGLa’s development we hope to contribute positively to the area of management of complex data in laboratory by managing its large amounts of data, guaranteeing the consistence of the data and increasing the laboratory productivity. We also hope to make possible to laboratories with little resources to afford a high level system for complex data management.
doi:10.1186/1471-2164-11-S5-S8
PMCID: PMC3045801  PMID: 21210974
20.  A machine learning approach for genome-wide prediction of morbid and druggable human genes based on systems-level data 
BMC Genomics  2010;11(Suppl 5):S9.
Background
The genome-wide identification of both morbid genes, i.e., those genes whose mutations cause hereditary human diseases, and druggable genes, i.e., genes coding for proteins whose modulation by small molecules elicits phenotypic effects, requires experimental approaches that are time-consuming and laborious. Thus, a computational approach which could accurately predict such genes on a genome-wide scale would be invaluable for accelerating the pace of discovery of causal relationships between genes and diseases as well as the determination of druggability of gene products.
Results
In this paper we propose a machine learning-based computational approach to predict morbid and druggable genes on a genome-wide scale. For this purpose, we constructed a decision tree-based meta-classifier and trained it on datasets containing, for each morbid and druggable gene, network topological features, tissue expression profile and subcellular localization data as learning attributes. This meta-classifier correctly recovered 65% of known morbid genes with a precision of 66% and correctly recovered 78% of known druggable genes with a precision of 75%. It was than used to assign morbidity and druggability scores to genes not known to be morbid and druggable and we showed a good match between these scores and literature data. Finally, we generated decision trees by training the J48 algorithm on the morbidity and druggability datasets to discover cellular rules for morbidity and druggability and, among the rules, we found that the number of regulating transcription factors and plasma membrane localization are the most important factors to morbidity and druggability, respectively.
Conclusions
We were able to demonstrate that network topological features along with tissue expression profile and subcellular localization can reliably predict human morbid and druggable genes on a genome-wide scale. Moreover, by constructing decision trees based on these data, we could discover cellular rules governing morbidity and druggability.
doi:10.1186/1471-2164-11-S5-S9
PMCID: PMC3045802  PMID: 21210975
21.  Combining modularity, conservation, and interactions of proteins significantly increases precision and coverage of protein function prediction 
BMC Genomics  2010;11:717.
Background
While the number of newly sequenced genomes and genes is constantly increasing, elucidation of their function still is a laborious and time-consuming task. This has led to the development of a wide range of methods for predicting protein functions in silico. We report on a new method that predicts function based on a combination of information about protein interactions, orthology, and the conservation of protein networks in different species.
Results
We show that aggregation of these independent sources of evidence leads to a drastic increase in number and quality of predictions when compared to baselines and other methods reported in the literature. For instance, our method generates more than 12,000 novel protein functions for human with an estimated precision of ~76%, among which are 7,500 new functional annotations for 1,973 human proteins that previously had zero or only one function annotated. We also verified our predictions on a set of genes that play an important role in colorectal cancer (MLH1, PMS2, EPHB4 ) and could confirm more than 73% of them based on evidence in the literature.
Conclusions
The combination of different methods into a single, comprehensive prediction method infers thousands of protein functions for every species included in the analysis at varying, yet always high levels of precision and very good coverage.
doi:10.1186/1471-2164-11-717
PMCID: PMC3017542  PMID: 21171995
22.  miRNeye: a microRNA expression atlas of the mouse eye 
BMC Genomics  2010;11:715.
Background
MicroRNAs (miRNAs) are key regulators of biological processes. To define miRNA function in the eye, it is essential to determine a high-resolution profile of their spatial and temporal distribution.
Results
In this report, we present the first comprehensive survey of miRNA expression in ocular tissues, using both microarray and RNA in situ hybridization (ISH) procedures. We initially determined the expression profiles of miRNAs in the retina, lens, cornea and retinal pigment epithelium of the adult mouse eye by microarray. Each tissue exhibited notably distinct miRNA enrichment patterns and cluster analysis identified groups of miRNAs that showed predominant expression in specific ocular tissues or combinations of them. Next, we performed RNA ISH for over 220 miRNAs, including those showing the highest expression levels by microarray, and generated a high-resolution expression atlas of miRNAs in the developing and adult wild-type mouse eye, which is accessible in the form of a publicly available web database. We found that 122 miRNAs displayed restricted expression domains in the eye at different developmental stages, with the majority of them expressed in one or more cell layers of the neural retina.
Conclusions
This analysis revealed miRNAs with differential expression in ocular tissues and provided a detailed atlas of their tissue-specific distribution during development of the murine eye. The combination of the two approaches offers a valuable resource to decipher the contributions of specific miRNAs and miRNA clusters to the development of distinct ocular structures.
doi:10.1186/1471-2164-11-715
PMCID: PMC3018480  PMID: 21171988
23.  The genetic organisation of prokaryotic two-component system signalling pathways 
BMC Genomics  2010;11:720.
Background
Two-component systems (TCSs) are modular and diverse signalling pathways, involving a stimulus-responsive transfer of phosphoryl groups from transmitter to partner receiver domains. TCS gene and domain organisation are both potentially informative regarding biological function, interaction partnerships and molecular mechanisms. However, there is currently little understanding of the relationships between domain architecture, gene organisation and TCS pathway structure.
Results
Here we classify the gene and domain organisation of TCS gene loci from 1405 prokaryotic replicons (>40,000 TCS proteins). We find that 200 bp is the most appropriate distance cut-off for defining whether two TCS genes are functionally linked. More than 90% of all TCS gene loci encode just one or two transmitter and/or receiver domains, however numerous other geometries exist, often with large numbers of encoded TCS domains. Such information provides insights into the distribution of TCS domains between genes, and within genes. As expected, the organisation of TCS genes and domains is affected by phylogeny, and plasmid-encoded TCS exhibit differences in organisation from their chromosomally-encoded counterparts.
Conclusions
We provide here an overview of the genomic and genetic organisation of TCS domains, as a resource for further research. We also propose novel metrics that build upon TCS gene/domain organisation data and allow comparisons between genomic complements of TCSs. In particular, 'percentage orphaned TCS genes' (or 'Dissemination') and 'percentage of complex loci' (or 'Sophistication') appear to be useful discriminators, and to reflect mechanistic aspects of TCS organisation not captured by existing metrics.
doi:10.1186/1471-2164-11-720
PMCID: PMC3018481  PMID: 21172000
24.  New methods for next generation sequencing based microRNA expression profiling 
BMC Genomics  2010;11:716.
Background
MicroRNAs are small non-coding RNA transcripts that regulate post-transcriptional gene expression. The millions of short sequence reads generated by next generation sequencing technologies make this technique explicitly suitable for profiling of known and novel microRNAs. A modification to the small-RNA expression kit (SREK, Ambion) library preparation method for the SOLiD sequencing platform is described to generate microRNA sequencing libraries that are compatible with the Illumina Genome Analyzer.
Results
High quality sequencing libraries can successfully be prepared from as little as 100 ng small RNA enriched RNA. An easy to use perl-based analysis pipeline called E-miR was developed to handle the sequencing data in several automated steps including data format conversion, 3' adapter removal, genome alignment and annotation to non-coding RNA transcripts. The sample preparation and E-miR pipeline were used to identify 37 cardiac enriched microRNAs in stage 16 chicken embryos. Isomir expression profiles between the heart and embryo were highly correlated for all miRNAs suggesting that tissue or cell specific miRNA modifications do not occur.
Conclusions
In conclusion, our alternative sample preparation method can successfully be applied to generate high quality miRNA sequencing libraries for the Illumina genome analyzer.
doi:10.1186/1471-2164-11-716
PMCID: PMC3022920  PMID: 21171994
25.  Recent transfer of an iron-regulated gene from the plastid to the nuclear genome in an oceanic diatom adapted to chronic iron limitation 
BMC Genomics  2010;11:718.
Background
Although the importance and widespread occurrence of iron limitation in the contemporary ocean is well documented, we still know relatively little about genetic adaptation of phytoplankton to these environments. Compared to its coastal relative Thalassiosira pseudonana, the oceanic diatom Thalassiosira oceanica is highly tolerant to iron limitation. The adaptation to low-iron conditions in T. oceanica has been attributed to a decrease in the photosynthetic components that are rich in iron. Genomic information on T. oceanica may shed light on the genetic basis of the physiological differences between the two species.
Results
The complete 141790 bp sequence of the T. oceanica chloroplast genome [GenBank: GU323224], assembled from massively parallel pyrosequencing (454) shotgun reads, revealed that the petF gene encoding for ferredoxin, which is localized in the chloroplast genome in T. pseudonana and other diatoms, has been transferred to the nucleus in T. oceanica. The iron-sulfur protein ferredoxin, a key element of the chloroplast electron transport chain, can be replaced by the iron-free flavodoxin under iron-limited growth conditions thereby contributing to a reduction in the cellular iron requirements. From a comparison to the genomic context of the T. pseudonana petF gene, the T. oceanica ortholog can be traced back to its chloroplast origin. The coding potential of the T. oceanica chloroplast genome is comparable to that of T. pseudonana and Phaeodactylum tricornutum, though a novel expressed ORF appears in the genomic region that has been subjected to rearrangements linked to the petF gene transfer event.
Conclusions
The transfer of the petF from the cp to the nuclear genome in T. oceanica represents a major difference between the two closely related species. The ability of T. oceanica to tolerate iron limitation suggests that the transfer of petF from the chloroplast to the nuclear genome might have contributed to the ecological success of this species.
doi:10.1186/1471-2164-11-718
PMCID: PMC3022921  PMID: 21171997

Results 1-25 (816)