Search tips
Search criteria

Results 1-25 (99)

Clipboard (0)
Year of Publication
Document Types
1.  The association of Alu repeats with the generation of potential AU-rich elements (ARE) at 3' untranslated regions. 
BMC Genomics  2004;5:97.
A significant portion (about 8% in the human genome) of mammalian mRNA sequences contains AU (Adenine and Uracil) rich elements or AREs at their 3' untranslated regions (UTR). These mRNA sequences are usually stable. However, an increasing number of observations have been made of unstable species, possibly depending on certain elements such as Alu repeats. ARE motifs are repeats of the tetramer AUUU and a monomer A at the end of the repeats ((AUUU)nA). The importance of AREs in biology is that they make certain mRNA unstable. Proto-oncogene, such as c-fos, c-myc, and c-jun in humans, are associated with AREs. Although it has been known that the increased number of ARE motifs caused the decrease of the half-life of mRNA containing ARE repeats, the exact mechanism is as of yet unknown. We analyzed the occurrences of AREs and Alu and propose a possible mechanism for how human mRNA could acquire and keep AREs at its 3' UTR originating from Alu repeats.
Interspersed in the human genome, Alu repeats occupy 5% of the 3' UTR of mRNA sequences. Alu has poly-adenine (poly-A) regions at its end, which lead to poly-thymine (poly-T) regions at the end of its complementary Alu. It has been found that AREs are present at the poly-T regions. From the 3' UTR of the NCBI's reference mRNA sequence database, we found nearly 40% (38.5%) of ARE (Class I) were associated with Alu sequences (Table 1) within one mismatch allowance in ARE sequences. Other ARE classes had statistically significant associations as well. This is far from a random occurrence given their limited quantity. At each ARE class, random distribution was simulated 1,000 times, and it was shown that there is a special relationship between ARE patterns and the Alu repeats.
Defined ARE classes. (Symbol marks are used in this study instead of full sequences.)
AREs are mediating sequence elements affecting the stabilization or degradation of mRNA at the 3' untranslated regions. However, AREs' mechanism and origins are unknown. We report that Alu is a source of ARE. We found that half of the longest AREs were derived from the poly-T regions of the complementary Alu.
PMCID: PMC544599  PMID: 15610565
2.  Evaluation of the chicken transcriptome by SAGE of B cells and the DT40 cell line 
BMC Genomics  2004;5:98.
The understanding of whole genome sequences in higher eukaryotes depends to a large degree on the reliable definition of transcription units including exon/intron structures, translated open reading frames (ORFs) and flanking untranslated regions. The best currently available chicken transcript catalog is the Ensembl build based on the mappings of a relatively small number of full length cDNAs and ESTs to the genome as well as genome sequence derived in silico gene predictions.
We use Long Serial Analysis of Gene Expression (LongSAGE) in bursal lymphocytes and the DT40 cell line to verify the quality and completeness of the annotated transcripts. 53.6% of the more than 38,000 unique SAGE tags (unitags) match to full length bursal cDNAs, the Ensembl transcript build or the genome sequence. The majority of all matching unitags show single matches to the genome, but no matches to the genome derived Ensembl transcript build. Nevertheless, most of these tags map close to the 3' boundaries of annotated Ensembl transcripts.
These results suggests that rather few genes are missing in the current Ensembl chicken transcript build, but that the 3' ends of many transcripts may not have been accurately predicted. The tags with no match in the transcript sequences can now be used to improve gene predictions, pinpoint the genomic location of entirely missed transcripts and optimize the accuracy of gene finder software.
PMCID: PMC543457  PMID: 15610564
3.  Arrays of ultraconserved non-coding regions span the loci of key developmental genes in vertebrate genomes 
BMC Genomics  2004;5:99.
Evolutionarily conserved sequences within or adjoining orthologous genes often serve as critical cis-regulatory regions. Recent studies have identified long, non-coding genomic regions that are perfectly conserved between human and mouse, termed ultra-conserved regions (UCRs). Here, we focus on UCRs that cluster around genes involved in early vertebrate development; genes conserved over 450 million years of vertebrate evolution.
Based on a high resolution detection procedure, our UCR set enables novel insights into vertebrate genome organization and regulation of developmentally important genes. We find that the genomic positions of deeply conserved UCRs are strongly associated with the locations of genes encoding key regulators of development, with particularly strong positional correlation to transcription factor-encoding genes. Of particular importance is the observation that most UCRs are clustered into arrays that span hundreds of kilobases around their presumptive target genes. Such a hallmark signature is present around several uncharacterized human genes predicted to encode developmentally important DNA-binding proteins.
The genomic organization of UCRs, combined with previous findings, suggests that UCRs act as essential long-range modulators of gene expression. The exceptional sequence conservation and clustered structure suggests that UCR-mediated molecular events involve greater complexity than traditional DNA binding by transcription factors. The high-resolution UCR collection presented here provides a wealth of target sequences for future experimental studies to determine the nature of the biochemical mechanisms involved in the preservation of arrays of nearly identical non-coding sequences over the course of vertebrate evolution.
PMCID: PMC544600  PMID: 15613238
4.  FunnyBase: a systems level functional annotation of Fundulus ESTs for the analysis of gene expression 
BMC Genomics  2004;5:96.
While studies of non-model organisms are critical for many research areas, such as evolution, development, and environmental biology, they present particular challenges for both experimental and computational genomic level research. Resources such as mass-produced microarrays and the computational tools linking these data to functional annotation at the system and pathway level are rarely available for non-model species. This type of "systems-level" analysis is critical to the understanding of patterns of gene expression that underlie biological processes.
We describe a bioinformatics pipeline known as FunnyBase that has been used to store, annotate, and analyze 40,363 expressed sequence tags (ESTs) from the heart and liver of the fish, Fundulus heteroclitus. Primary annotations based on sequence similarity are linked to networks of systematic annotation in Gene Ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) and can be queried and computationally utilized in downstream analyses. Steps are taken to ensure that the annotation is self-consistent and that the structure of GO is used to identify higher level functions that may not be annotated directly. An integrated framework for cDNA library production, sequencing, quality control, expression data generation, and systems-level analysis is presented and utilized. In a case study, a set of genes, that had statistically significant regression between gene expression levels and environmental temperature along the Atlantic Coast, shows a statistically significant (P < 0.001) enrichment in genes associated with amine metabolism.
The methods described have application for functional genomics studies, particularly among non-model organisms. The web interface for FunnyBase can be accessed at . Data and source code are available by request at
PMCID: PMC544896  PMID: 15610557
5.  Homopolymer tract length dependent enrichments in functional regions of 27 eukaryotes and their novel dependence on the organism DNA (G+C)% composition 
BMC Genomics  2004;5:95.
DNA homopolymer tracts, poly(dA).poly(dT) and poly(dG).poly(dC), are the simplest of simple sequence repeats. Homopolymer tracts have been systematically examined in the coding, intron and flanking regions of a limited number of eukaryotes. As the number of DNA sequences publicly available increases, the representation (over and under) of homopolymer tracts of different lengths in these regions of different genomes can be compared.
We carried out a survey of the extent of homopolymer tract over-representation (enrichment) and over-proportional length distribution (above expected length) primarily in the single gene documents, but including some whole chromosomes of 27 eukaryotics across the (G+C)% composition range from 20 – 60%. A total of 5.2 × 107 bases from 15,560 cleaned (redundancy removed) sequence documents were analyzed. Calculated frequencies of non-overlapping long homopolymer tracts were found over-represented in non-coding sequences of eukaryotes. Long poly(dA).poly(dT) tracts demonstrated an exponential increase with tract length compared to predicted frequencies. A novel negative slope was observed for all eukaryotes between their (G+C)% composition and the threshold length N where poly(dA).poly(dT) tracts exhibited over-representation and a corresponding positive slope was observed for poly(dG).poly(dC) tracts. Tract size thresholds where over-representation of tracts in different eukaryotes began to occur was between 4 – 11 bp depending upon the organism (G+C)% composition. The higher the GC%, the lower the threshold N value was for poly(dA).poly(dT) tracts, meaning that the over-representation happens at relatively lower tract length in more GC-rich surrounding sequence. We also observed a novel relationship between the highest over-representations, as well as lengths of homopolymer tracts in excess of their random occurrence expected maximum lengths.
We discuss how our novel tract over-representation observations can be accounted for by a few models. A likely model for poly(dA).poly(dT) tract over-representation involves the known insertion into genomes of DNA synthesized from retroviral mRNAs containing 3' polyA tails. A proposed model that can account for a number of our observed results, concerns the origin of the isochore nature of eukaryotic genomes via a non-equilibrium GC% dependent mutation rate mechanism. Our data also suggest that tract lengthening via slip strand replication is not governed by a simple thermodynamic loop energy model.
PMCID: PMC539357  PMID: 15598342
6.  Prognostic meta-signature of breast cancer developed by two-stage mixture modeling of microarray data 
BMC Genomics  2004;5:94.
An increasing number of studies have profiled tumor specimens using distinct microarray platforms and analysis techniques. With the accumulating amount of microarray data, one of the most intriguing yet challenging tasks is to develop robust statistical models to integrate the findings.
By applying a two-stage Bayesian mixture modeling strategy, we were able to assimilate and analyze four independent microarray studies to derive an inter-study validated "meta-signature" associated with breast cancer prognosis. Combining multiple studies (n = 305 samples) on a common probability scale, we developed a 90-gene meta-signature, which strongly associated with survival in breast cancer patients. Given the set of independent studies using different microarray platforms which included spotted cDNAs, Affymetrix GeneChip, and inkjet oligonucleotides, the individually identified classifiers yielded gene sets predictive of survival in each study cohort. The study-specific gene signatures, however, had minimal overlap with each other, and performed poorly in pairwise cross-validation. The meta-signature, on the other hand, accommodated such heterogeneity and achieved comparable or better prognostic performance when compared with the individual signatures. Further by comparing to a global standardization method, the mixture model based data transformation demonstrated superior properties for data integration and provided solid basis for building classifiers at the second stage. Functional annotation revealed that genes involved in cell cycle and signal transduction activities were over-represented in the meta-signature.
The mixture modeling approach unifies disparate gene expression data on a common probability scale allowing for robust, inter-study validated prognostic signatures to be obtained. With the emerging utility of microarrays for cancer prognosis, it will be important to establish paradigms to meta-analyze disparate gene expression data for prognostic signatures of potential clinical use.
PMCID: PMC544889  PMID: 15598354
7.  GOLD.db: genomics of lipid-associated disorders database 
BMC Genomics  2004;5:93.
The GOLD.db (Genomics of Lipid-Associated Disorders Database) was developed to address the need for integrating disparate information on the function and properties of genes and their products that are particularly relevant to the biology, diagnosis management, treatment, and prevention of lipid-associated disorders.
The GOLD.db provides a reference for pathways and information about the relevant genes and proteins in an efficiently organized way. The main focus was to provide biological pathways with image maps and visual pathway information for lipid metabolism and obesity-related research. This database provides also the possibility to map gene expression data individually to each pathway. Gene expression at different experimental conditions can be viewed sequentially in context of the pathway. Related large scale gene expression data sets were provided and can be searched for specific genes to integrate information regarding their expression levels in different studies and conditions. Analytic and data mining tools, reagents, protocols, references, and links to relevant genomic resources were included in the database. Finally, the usability of the database was demonstrated using an example about the regulation of Pten mRNA during adipocyte differentiation in the context of relevant pathways.
The GOLD.db will be a valuable tool that allow researchers to efficiently analyze patterns of gene expression and to display them in a variety of useful and informative ways, allowing outside researchers to perform queries pertaining to gene expression results in the context of biological processes and pathways.
PMCID: PMC544894  PMID: 15588328
8.  Polymorphic segmental duplications at 8p23.1 challenge the determination of individual defensin gene repertoires and the assembly of a contiguous human reference sequence 
BMC Genomics  2004;5:92.
Defensins are important components of innate immunity to combat bacterial and viral infections, and can even elicit antitumor responses. Clusters of defensin (DEF) genes are located in a 2 Mb range of the human chromosome 8p23.1. This DEF locus, however, represents one of the regions in the euchromatic part of the final human genome sequence which contains segmental duplications, and recalcitrant gaps indicating high structural dynamics.
We find that inter- and intraindividual genetic variations within this locus prevent a correct automatic assembly of the human reference genome (NCBI Build 34) which currently even contains misassemblies. Manual clone-by-clone alignment and gene annotation as well as repeat and SNP/haplotype analyses result in an alternative alignment significantly improving the DEF locus representation. Our assembly better reflects the experimentally verified variability of DEF gene and DEF cluster copy numbers. It contains an additional DEF cluster which we propose to reside between two already known clusters. Furthermore, manual annotation revealed a novel DEF gene and several pseudogenes expanding the hitherto known DEF repertoire. Analyses of BAC and working draft sequences of the chimpanzee indicates that its DEF region is also complex as in humans and DEF genes and a cluster are multiplied. Comparative analysis of human and chimpanzee DEF genes identified differences affecting the protein structure. Whether this might contribute to differences in disease susceptibility between man and ape remains to be solved. For the determination of individual DEF gene repertoires we provide a molecular approach based on DEF haplotypes.
Complexity and variability seem to be essential genomic features of the human DEF locus at 8p23.1 and provides an ongoing challenge for the best possible representation in the human reference sequence. Dissection of paralogous sequence variations, duplicon SNPs ans multisite variations as well as haplotypes by sequencing based methods is the way for future studies of interindividual DEF locus variability and its disease association.
PMCID: PMC544879  PMID: 15588320
9.  Lower rate of genomic variation identified in the trans-membrane domain of monoamine sub-class of Human G-Protein Coupled Receptors: The Human GPCR-DB Database 
BMC Genomics  2004;5:91.
We have surveyed, compiled and annotated nucleotide variations in 338 human 7-transmembrane receptors (G-protein coupled receptors). In a sample of 32 chromosomes from a Nordic population, we attempted to determine the allele frequencies of 80 non-synonymous SNPs, and found 20 novel polymorphic markers. GPCR receptors of physiological and clinical importance were prioritized for statistical analysis. Natural variation and rare mutation information were merged and presented online in the Human GPCR-DB database .
The average number of SNPs per 1000 bases of exonic sequence was found to be twice the average number of SNPs per Kilobase of intronic regions (2.2 versus 1.0). Of the 338 genes, 111 were single exon genes, that is, were intronless. The average number of exonic-SNPs per single-exon gene was 3.5 (n = 395) while that for multi-exon genes was 0.8 (n = 1176). The average number of variations within the different protein domain (N-terminus, internal- and external-loops, trans-membrane region, C-terminus) indicates a lower rate of variation in the trans-membrane region of Monoamine GPCRs, as compared to Chemokine- and Peptide-receptor sub-classes of GPCRs.
Single-exon GPCRs on average have approximately three times the number of SNPs as compared to GPCRs with introns. Among various functional classes of GPCRs, Monoamine GPRCs have lower number of natural variations within the trans-membrane domain indicating evolutionary selection against non-synonymous changes within the membrane-localizing domain of this sub-class of GPCRs.
PMCID: PMC538281  PMID: 15579207
10.  Characterization of the chicken inward rectifier K+ channel IRK1/Kir2.1 gene 
BMC Genomics  2004;5:90.
Inward rectifier potassium channels (IRK) contribute to the normal function of skeletal and cardiac muscle cells. The chick inward rectifier K+ channel cIRK1/Kir2.1 is expressed in skeletal muscle, heart, brain, but not in liver; a distribution similar but not identical to that of mouse Kir2.1. We set out to explore regulatory domains of the cIRK1 promoter that enhance or inhibit expression of the gene in different cell types.
We cloned and characterized the 5'-flanking region of cIRK1. cIRK1 contains two exons with splice sites in the 5'-untranslated region, a structure similar to mouse and human orthologs. cIRK1 has multiple transcription initiation sites, a feature also seen in mouse. However, while the chicken and mouse promoter regions share many regulatory motifs, cIRK1 possesses a GC-richer promoter and a putative TATA box, which appears to positively regulate gene expression. We report here the identification of several candidate cell/tissue specific cIRK1 regulatory domains by comparing promoter activities in expressing (Qm7) and non-expressing (DF1) cells using in vitro transcription assays.
While multiple transcription initiation sites and the combinatorial function of several domains in activating cIRK1 expression are similar to those seen in mKir2.1, the cIRK1 promoter differs by the presence of a putative TATA box. In addition, several domains that regulate the gene's expression differentially in muscle (Qm7) and fibroblast cells (DF1) were identified. These results provide fundamental data to analyze cIRK1 transcriptional mechanisms. The control elements identified here may provide clues to the tissue-specific expression of this K+ channel.
PMCID: PMC538280  PMID: 15569391
11.  The rehydration transcriptome of the desiccation-tolerant bryophyte Tortula ruralis: transcript classification and analysis 
BMC Genomics  2004;5:89.
The cellular response of plants to water-deficits has both economic and evolutionary importance directly affecting plant productivity in agriculture and plant survival in the natural environment. Genes induced by water-deficit stress have been successfully enumerated in plants that are relatively sensitive to cellular dehydration, however we have little knowledge as to the adaptive role of these genes in establishing tolerance to water loss at the cellular level. Our approach to address this problem has been to investigate the genetic responses of plants that are capable of tolerating extremes of dehydration, in particular the desiccation-tolerant bryophyte, Tortula ruralis. To establish a sound basis for characterizing the Tortula genome in regards to desiccation tolerance, we analyzed 10,368 expressed sequence tags (ESTs) from rehydrated rapid-dried Tortula gametophytes, a stage previously determined to exhibit the maximum stress induced change in gene expression.
The 10, 368 ESTs formed 5,563 EST clusters (contig groups representing individual genes) of which 3,321 (59.7%) exhibited similarity to genes present in the public databases and 2,242 were categorized as unknowns based on protein homology scores. The 3,321 clusters were classified by function using the Gene Ontology (GO) hierarchy and the KEGG database. The results indicate that the transcriptome contains a diverse population of transcripts that reflects, as expected, a period of metabolic upheaval in the gametophyte cells. Much of the emphasis within the transcriptome is centered on the protein synthetic machinery, ion and metabolite transport, and membrane biosynthesis and repair. Rehydrating gametophytes also have an abundance of transcripts that code for enzymes involved in oxidative stress metabolism and phosphorylating activities. The functional classifications reflect a remarkable consistency with what we have previously established with regards to the metabolic activities that are important in the recovery of the gametophytes from desiccation. A comparison of the GO distribution of Tortula clusters with an identical analysis of 9,981 clusters from the desiccation sensitive bryophyte species Physcomitrella patens, revealed, and accentuated, the differences between stressed and unstressed transcriptomes. Cross species sequence comparisons indicated that on the whole the Tortula clusters were more closely related to those from Physcomitrella than Arabidopsis (complete genome BLASTx comparison) although because of the differences in the databases there were more high scoring matches to the Arabidopsis sequences. The most abundant transcripts contained within the Tortula ESTs encode Late Embryogenesis Abundant (LEA) proteins that are normally associated with drying plant tissues. This suggests that LEAs may also play a role in recovery from desiccation when water is reintroduced into a dried tissue.
The establishment of a rehydration EST collection for Tortula ruralis, an important plant model for plant stress responses and vegetative desiccation tolerance, is an important step in understanding the genome level response to cellular dehydration. The type of transcript analysis performed here has laid the foundation for more detailed functional and genome level analyses of the genes involved in desiccation tolerance in plants.
PMCID: PMC535811  PMID: 15546486
12.  Comparison of frozen and RNALater solid tissue storage methods for use in RNA expression microarrays 
BMC Genomics  2004;5:88.
Primary human tissues are an invaluable widely used tool for discovery of gene expression patterns which characterize disease states. Tissue processing methods remain unstandardized, leading to unanswered concerns of how to best store collected tissues and maintain reproducibility between laboratories. We subdivided uterine myometrial tissue specimens and stored split aliquots using the most common tissue processing methods (fresh, frozen, RNALater) before comparing quantitative RNA expression profiles on the Affymetrix U133 human expression array. Split samples and inclusion of duplicates within each processing group allowed us to undertake a formal genome-wide analysis comparing the magnitude of result variation contributed by sample source (different patients), processing protocol (fresh vs. frozen vs. 24 or 72 hours RNALater), and random background (duplicates). The dataset was randomly permuted to define a baseline pattern of ANOVA test statistic values against which the observed results could be interpreted.
14,639 of 22,283 genes were expressed in at least one sample. Patient subjects provided the greatest sources of variation in the mixed model ANOVA, with replicates and processing method the least. The magnitude of variation conferred by processing method (24 hours RNALater vs 72 hours RNALater vs. fresh vs frozen) was similar to the variability seen within replicates. Subset analysis of the test statistic according to gene functional class showed that the frequency of "outlier" ANOVA results within each functional class is overall no greater than expected by chance.
Ambient storage of tissues for 24 or 72 hours in RNALater did not contribute any systematic shift in quantitative RNA expression results relative to the alternatives of fresh or frozen tissue. This nontoxic preservative enables decentralized tissue collection for expression array analysis without a requirement for specialized equipment.
PMCID: PMC534099  PMID: 15537428
13.  Functional characterization in Caenorhabditis elegans of transmembrane worm-human orthologs 
BMC Genomics  2004;5:85.
The complete genome sequences for human and the nematode Caenorhabditis elegans offer an opportunity to learn more about human gene function through functional characterization of orthologs in the worm. Based on a previous genome-wide analysis of worm-human orthologous transmembrane proteins, we selected seventeen genes to explore experimentally in C. elegans. These genes were selected on the basis that they all have high confidence candidate human orthologs and that their function is unknown. We first analyzed their phylogeny, membrane topology and domain organization. Then gene functions were studied experimentally in the worm by using RNA interference and transcriptional gfp reporter gene fusions.
The experiments gave functional insights for twelve of the genes studied. For example, C36B1.12, the worm ortholog of three presenilin-like genes, was almost exclusively expressed in head neurons, suggesting an ancient conserved role important to neuronal function. We propose a new transmembrane topology for the presenilin-like protein family. sft-4, the worm ortholog of surfeit locus gene Surf-4, proved to be an essential gene required for development during the larval stages of the worm. R155.1, whose human ortholog is entirely uncharacterized, was implicated in body size control and other developmental processes.
By combining bioinformatics and C. elegans experiments on orthologs, we provide functional insights on twelve previously uncharacterized human genes.
PMCID: PMC533873  PMID: 15533247
14.  Genomic structure and cloning of two transcript isoforms of human Sp8 
BMC Genomics  2004;5:86.
The Specificity proteins (Sp) are a family of transcription factors that have three highly conserved zinc-fingers located towards the carboxy-terminal that bind GC-boxes and assist in the initiation of gene transcription. Human Sp1-7 genes have been characterized. Recently, the phenotype of Sp8 null mice has been described, being tailless and having severe truncation of both fore and hind limbs. They also have malformed brains with defective closure of the anterior and posterior neuropore during brain development.
The human Sp8 gene is a three-exon gene that maps to 7p21.3, close to the related Sp4 gene. From an osteosarcoma cell line we cloned two transcript variants that use two different first exons and have a common second exon. One clone encodes a 508-residue protein, Sp8L (isoform 1) and the other a shorter 490-residue protein, Sp8S (isoform 2). These two isoforms are conserved being found also in mice and zebrafish. Analysis of the Sp8L protein sequence reveals an amino-terminal hydrophobic Sp-motif that is disrupted in Sp8S, a buttonhead box and three C2H2 zinc-fingers. Sp8 mRNA expression was detected in a wide range of tissues at a low level, with the highest levels being found in brain. Treatment of the murine pluripotent cell line C3H10T1/2 with 100 ng/mL BMP-2 induced Sp8 mRNA after 24 hours.
There is conservation of the two Sp8 protein isoforms between primates, rodents and fish, suggesting that the isoforms have differing roles in gene regulation. Sp8 may play a role in chondrogenic/osteoblastic differentiation in addition to its role in brain and limb development.
PMCID: PMC534095  PMID: 15533246
15.  Sample size for detecting differentially expressed genes in microarray experiments 
BMC Genomics  2004;5:87.
Microarray experiments are often performed with a small number of biological replicates, resulting in low statistical power for detecting differentially expressed genes and concomitant high false positive rates. While increasing sample size can increase statistical power and decrease error rates, with too many samples, valuable resources are not used efficiently. The issue of how many replicates are required in a typical experimental system needs to be addressed. Of particular interest is the difference in required sample sizes for similar experiments in inbred vs. outbred populations (e.g. mouse and rat vs. human).
We hypothesize that if all other factors (assay protocol, microarray platform, data pre-processing) were equal, fewer individuals would be needed for the same statistical power using inbred animals as opposed to unrelated human subjects, as genetic effects on gene expression will be removed in the inbred populations. We apply the same normalization algorithm and estimate the variance of gene expression for a variety of cDNA data sets (humans, inbred mice and rats) comparing two conditions. Using one sample, paired sample or two independent sample t-tests, we calculate the sample sizes required to detect a 1.5-, 2-, and 4-fold changes in expression level as a function of false positive rate, power and percentage of genes that have a standard deviation below a given percentile.
Factors that affect power and sample size calculations include variability of the population, the desired detectable differences, the power to detect the differences, and an acceptable error rate. In addition, experimental design, technical variability and data pre-processing play a role in the power of the statistical tests in microarrays. We show that the number of samples required for detecting a 2-fold change with 90% probability and a p-value of 0.01 in humans is much larger than the number of samples commonly used in present day studies, and that far fewer individuals are needed for the same statistical power when using inbred animals rather than unrelated human subjects.
PMCID: PMC533874  PMID: 15533245
16.  The use of Open Reading frame ESTs (ORESTES) for analysis of the honey bee transcriptome 
BMC Genomics  2004;5:84.
The ongoing efforts to sequence the honey bee genome require additional initiatives to define its transcriptome. Towards this end, we employed the Open Reading frame ESTs (ORESTES) strategy to generate profiles for the life cycle of Apis mellifera workers.
Of the 5,021 ORESTES, 35.2% matched with previously deposited Apis ESTs. The analysis of the remaining sequences defined a set of putative orthologs whose majority had their best-match hits with Anopheles and Drosophila genes. CAP3 assembly of the Apis ORESTES with the already existing 15,500 Apis ESTs generated 3,408 contigs. BLASTX comparison of these contigs with protein sets of organisms representing distinct phylogenetic clades revealed a total of 1,629 contigs that Apis mellifera shares with different taxa. Most (41%) represent genes that are in common to all taxa, another 21% are shared between metazoans (Bilateria), and 16% are shared only within the Insecta clade. A set of 23 putative genes presented a best match with human genes, many of which encode factors related to cell signaling/signal transduction. 1,779 contigs (52%) did not match any known sequence. Applying a correction factor deduced from a parallel analysis performed with Drosophila melanogaster ORESTES, we estimate that approximately half of these no-match ESTs contigs (22%) should represent Apis-specific genes.
The versatile and cost-efficient ORESTES approach produced minilibraries for honey bee life cycle stages. Such information on central gene regions contributes to genome annotation and also lends itself to cross-transcriptome comparisons to reveal evolutionary trends in insect genomes.
PMCID: PMC533872  PMID: 15527499
17.  Cross-species hybridisation of human and bovine orthologous genes on high density cDNA microarrays 
BMC Genomics  2004;5:83.
Cross-species gene-expression comparison is a powerful tool for the discovery of evolutionarily conserved mechanisms and pathways of expression control. The usefulness of cDNA microarrays in this context is that broad areas of homology are compared and hybridization probes are sufficiently large that small inter-species differences in nucleotide sequence would not affect the analytical results. This comparative genomics approach would allow a common set of genes within a specific developmental, metabolic, or disease-related gene pathway to be evaluated in experimental models of human diseases. The objective of this study was to investigate the feasibility and reproducibility of cross-species analysis employing a human cDNA microarray as probe.
As a proof of principle, total RNA derived from human and bovine fetal brains was used as a source of labelled targets for hybridisation onto a human cDNA microarray composed of 349 characterised genes. Each gene was spotted 20 times representing 6,980 data points thus enabling highly reproducible spot quantification. Employing high stringency hybridisation and washing conditions, followed by data analysis, revealed slight differences in the expression levels and reproducibility of the signals between the two species. We also assigned each of the genes into three expression level categories- i.e. high, medium and low. The correlation co-efficient of cross hybridisation between the orthologous genes was 0.94. Verification of the array data by semi-quantitative RT-PCR using common primer sequences enabled co-amplification of both human and bovine transcripts. Finally, we were able to assign gene names to previously uncharacterised bovine ESTs.
Results of our study demonstrate the harnessing and utilisation power of comparative genomics and prove the feasibility of using human microarrays to facilitate the identification of co-expressed orthologous genes in common tissues derived from different species.
PMCID: PMC535340  PMID: 15511299
18.  Microarray and comparative genomics-based identification of genes and gene regulatory regions of the mouse immune system 
BMC Genomics  2004;5:82.
In this study we have built and mined a gene expression database composed of 65 diverse mouse tissues for genes preferentially expressed in immune tissues and cell types. Using expression pattern criteria, we identified 360 genes with preferential expression in thymus, spleen, peripheral blood mononuclear cells, lymph nodes (unstimulated or stimulated), or in vitro activated T-cells.
Gene clusters, formed based on similarity of expression-pattern across either all tissues or the immune tissues only, had highly significant associations both with immunological processes such as chemokine-mediated response, antigen processing, receptor-related signal transduction, and transcriptional regulation, and also with more general processes such as replication and cell cycle control. Within-cluster gene correlations implicated known associations of known genes, as well as immune process-related roles for poorly described genes. To characterize regulatory mechanisms and cis-elements of genes with similar patterns of expression, we used a new version of a comparative genomics-based cis-element analysis tool to identify clusters of cis-elements with compositional similarity among multiple genes. Several clusters contained genes that shared 5–6 cis-elements that included ETS and zinc-finger binding sites. cis-Elements AP2 EGRF ETSF MAZF SP1F ZF5F and AREB ETSF MZF1 PAX5 STAT were shared in a thymus-expressed set; AP4R E2FF EBOX ETSF MAZF SP1F ZF5F and CREB E2FF MAZF PCAT SP1F STAT cis-clusters occurred in activated T-cells; CEBP CREB NFKB SORY and GATA NKXH OCT1 RBIT occurred in stimulated lymph nodes.
This study demonstrates a series of analytic approaches that have allowed the implication of genes and regulatory elements that participate in the differentiation, maintenance, and function of the immune system. Polymorphism or mutation of these could adversely impact immune system functions.
PMCID: PMC534115  PMID: 15504237
19.  GeneLink: a database to facilitate genetic studies of complex traits 
BMC Genomics  2004;5:81.
In contrast to gene-mapping studies of simple Mendelian disorders, genetic analyses of complex traits are far more challenging, and high quality data management systems are often critical to the success of these projects. To minimize the difficulties inherent in complex trait studies, we have developed GeneLink, a Web-accessible, password-protected Sybase database.
GeneLink is a powerful tool for complex trait mapping, enabling genotypic data to be easily merged with pedigree and extensive phenotypic data. Specifically designed to facilitate large-scale (multi-center) genetic linkage or association studies, GeneLink securely and efficiently handles large amounts of data and provides additional features to facilitate data analysis by existing software packages and quality control. These include the ability to download chromosome-specific data files containing marker data in map order in various formats appropriate for downstream analyses (e.g., GAS and LINKAGE). Furthermore, an unlimited number of phenotypes (either qualitative or quantitative) can be stored and analyzed. Finally, GeneLink generates several quality assurance reports, including genotyping success rates of specified DNA samples or success and heterozygosity rates for specified markers.
GeneLink has already proven an invaluable tool for complex trait mapping studies and is discussed primarily in the context of our large, multi-center study of hereditary prostate cancer (HPC). GeneLink is freely available at .
PMCID: PMC526767  PMID: 15491493
20.  Reconstruction of putative DNA virus from endogenous rice tungro bacilliform virus-like sequences in the rice genome: implications for integration and evolution 
BMC Genomics  2004;5:80.
Plant genomes contain various kinds of repetitive sequences such as transposable elements, microsatellites, tandem repeats and virus-like sequences. Most of them, with the exception of virus-like sequences, do not allow us to trace their origins nor to follow the process of their integration into the host genome. Recent discoveries of virus-like sequences in plant genomes led us to set the objective of elucidating the origin of the repetitive sequences. Endogenous rice tungro bacilliform virus (RTBV)-like sequences (ERTBVs) have been found throughout the rice genome. Here, we reconstructed putative virus structures from RTBV-like sequences in the rice genome and characterized to understand evolutionary implication, integration manner and involvements of endogenous virus segments in the corresponding disease response.
We have collected ERTBVs from the rice genomes. They contain rearranged structures and no intact ORFs. The identified ERTBV segments were shown to be phylogenetically divided into three clusters. For each phylogenetic cluster, we were able to make a consensus alignment for a circular virus-like structure carrying two complete ORFs. Comparisons of DNA and amino acid sequences suggested the closely relationship between ERTBV and RTBV. The Oryza AA-genome species vary in the ERTBV copy number. The species carrying low-copy-number of ERTBV segments have been reported to be extremely susceptible to RTBV. The DNA methylation state of the ERTBV sequences was correlated with their copy number in the genome.
These ERTBV segments are unlikely to have functional potential as a virus. However, these sequences facilitate to establish putative virus that provided information underlying virus integration and evolutionary relationship with existing virus. Comparison of ERTBV among the Oryza AA-genome species allowed us to speculate a possible role of endogenous virus segments against its related disease.
PMCID: PMC526188  PMID: 15488154
21.  Protein kinases of the human malaria parasite Plasmodium falciparum: the kinome of a divergent eukaryote 
BMC Genomics  2004;5:79.
Malaria, caused by the parasitic protist Plasmodium falciparum, represents a major public health problem in the developing world. The P. falciparum genome has been sequenced, which provides new opportunities for the identification of novel drug targets. Eukaryotic protein kinases (ePKs) form a large family of enzymes with crucial roles in most cellular processes; hence malarial ePKS represent potential drug targets. We report an exhaustive analysis of the P. falciparum genomic database (PlasmoDB) aimed at identifying and classifying all ePKs in this organism.
Using a variety of bioinformatics tools, we identified 65 malarial ePK sequences and constructed a phylogenetic tree to position these sequences relative to the seven established ePK groups. Predominant features of the tree were: (i) that several malarial sequences did not cluster within any of the known ePK groups; (ii) that the CMGC group, whose members are usually involved in the control of cell proliferation, had the highest number of malarial ePKs; and (iii) that no malarial ePK clustered with the tyrosine kinase (TyrK) or STE groups, pointing to the absence of three-component MAPK modules in the parasite. A novel family of 20 ePK-related sequences was identified and called FIKK, on the basis of a conserved amino acid motif. The FIKK family seems restricted to Apicomplexa, with 20 members in P. falciparum and just one member in some other Apicomplexan species.
The considerable phylogenetic distance between Apicomplexa and other Eukaryotes is reflected by profound divergences between the kinome of malaria parasites and that of yeast or mammalian cells.
PMCID: PMC526369  PMID: 15479470
22.  i-Genome: A database to summarize oligonucleotide data in genomes 
BMC Genomics  2004;5:78.
Information on the occurrence of sequence features in genomes is crucial to comparative genomics, evolutionary analysis, the analyses of regulatory sequences and the quantitative evaluation of sequences. Computing the frequencies and the occurrences of a pattern in complete genomes is time-consuming.
The proposed database provides information about sequence features generated by exhaustively computing the sequences of the complete genome. The repetitive elements in the eukaryotic genomes, such as LINEs, SINEs, Alu and LTR, are obtained from Repbase. The database supports various complete genomes including human, yeast, worm, and 128 microbial genomes.
This investigation presents and implements an efficiently computational approach to accumulate the occurrences of the oligonucleotides or patterns in complete genomes. A database is established to maintain the information of the sequence features, including the distributions of oligonucleotide, the gene distribution, the distribution of repetitive elements in genomes and the occurrences of the oligonucleotides. The database can provide more effective and efficient way to access the repetitive features in genomes.
PMCID: PMC526275  PMID: 15473908
repeat; genome index; oligonucleotide; database
23.  Integrating linkage and radiation hybrid mapping data for bovine chromosome 15 
BMC Genomics  2004;5:77.
Bovine chromosome (BTA) 15 contains a quantitative trait loci (QTL) for meat tenderness, as well as several breaks in synteny with human chromosome (HSA) 11. Both linkage and radiation hybrid (RH) maps of BTA 15 are available, but the linkage map lacks gene-specific markers needed to identify genes underlying the QTL, and the gene-rich RH map lacks associations with marker genotypes needed to define the QTL. Integrating the maps will provide information to further explore the QTL as well as refine the comparative map between BTA 15 and HSA 11. A recently developed approach to integrating linkage and RH maps uses both linkage and RH data to resolve a consensus marker order, rather than aligning independently constructed maps. Automated map construction procedures employing this maximum-likelihood approach were developed to integrate BTA RH and linkage data, and establish comparative positions of BTA 15 markers with HSA 11 homologs.
The integrated BTA 15 map represents 145 markers; 42 shared by both data sets, 36 unique to the linkage data and 67 unique to RH data. Sequence alignment yielded comparative positions for 77 bovine markers with homologs on HSA 11. The map covers approximately 32% of HSA 11 sequence in five segments of conserved synteny, another 15% of HSA 11 is shared with BTA 29. Bovine and human order are consistent in portions of the syntenic segments, but some rearrangement is apparent. Comparative positions of gene markers near the meat tenderness QTL indicate the region includes separate segments of HSA 11. The two microsatellite markers flanking the QTL peak are between defined syntenic segments.
Combining data to construct an integrated map not only consolidates information from different sources onto a single map, but information contributed from each data set increases the accuracy of the map. Comparison of bovine maps with well annotated human sequence can provide useful information about genes near mapped bovine markers, but bovine gene order may be different than human. Procedures to connect genetic and physical mapping data, build integrated maps for livestock species, and connect those maps to more fully annotated sequence can be automated, facilitating the maintenance of up-to-date maps, and providing a valuable tool to further explore genetic variation in livestock.
PMCID: PMC526187  PMID: 15473903
24.  Evaluation of sense-strand mRNA amplification by comparative quantitative PCR 
BMC Genomics  2004;5:76.
RNA amplification is required for incorporating laser-capture microdissection techniques into microarray assays. However, standard oligonucleotide microarrays contain sense-strand probes, so traditional T7 amplification schemes producing anti-sense RNA are not appropriate for hybridization when combined with conventional reverse transcription labeling methods. We wished to assess the accuracy of a new sense-strand RNA amplification method by comparing ratios between two samples using quantitative real-time PCR (qPCR), mimicking a two-color microarray assay.
We performed our validation using qPCR. Three samples of rat brain RNA and three samples of rat liver RNA were amplified using several kits (Ambion messageAmp, NuGen Ovation, and several versions of Genisphere SenseAmp). Results were assessed by comparing the liver/brain ratio for 192 mRNAs before and after amplification. In general, all kits produced strong correlations with unamplified RNAs. The SenseAmp kit produced the highest correlation, and was also able to amplify a partially degraded sample accurately.
We have validated an optimized sense-strand RNA amplification method for use in comparative studies such as two-color microarrays.
PMCID: PMC524485  PMID: 15469607
25.  Extreme conservation of noncoding DNA near HoxD complex of vertebrates 
BMC Genomics  2004;5:75.
Homeotic gene complexes determine the anterior-posterior body axis in animals. The expression pattern and function of hox genes along this axis is colinear with the order in which they are organized in the complex. This 'chromosomal organization and functional correspondence' is conserved in all bilaterians investigated. Genomic sequences covering the HoxD complex from several vertebrate species are now available. This offers a comparative genomics approach to identify conserved regions linked to this complex. Although the molecular basis of 'colinearity' of Hox complexes is not yet understood, it is possible that there are control elements within or in the proximity of these complexes that establish and maintain the expression patterns of hox genes in a coordinated fashion.
We have compared DNA sequence flanking the HoxD complex of several primate, rodent and fish species. This analysis revealed an unprecedented conservation of non-coding DNA sequences adjacent to the HoxD complex from fish to human. Stretches of hundreds of base pairs in a 7 kb region, upstream of HoxD complex, show 100% conservation across the vertebrate species. Using PCR primers from the human sequence, these conserved regions could be amplified from other vertebrate species, including other mammals, birds, reptiles, amphibians and fish. Our analysis of these sequences also indicates that starting from the conserved core regions, more sequences have been added on and maintained during evolution from fish to human.
Such a high degree of conservation in the core regions of this 7 kb DNA, where no variation occurred during ~500 million years of evolution, suggests critical function for these sequences. We suggest that such sequences are likely to provide molecular handle to gain insight into the evolution and mechanism of regulation of associated gene complexes.
PMCID: PMC524357  PMID: 15462684

Results 1-25 (99)