Search tips
Search criteria

Results 1-10 (10)

Clipboard (0)
Year of Publication
Document Types
1.  Whole genome sequencing of an ethnic Pathan (Pakhtun) from the north-west of Pakistan 
BMC Genomics  2015;16(1):172.
Pakistan covers a key geographic area in human history, being both part of the Indus River region that acted as one of the cradles of civilization and as a link between Western Eurasia and Eastern Asia. This region is inhabited by a number of distinct ethnic groups, the largest being the Punjabi, Pathan (Pakhtuns), Sindhi, and Baloch.
We analyzed the first ethnic male Pathan genome by sequencing it to 29.7-fold coverage using the Illumina HiSeq2000 platform. A total of 3.8 million single nucleotide variations (SNVs) and 0.5 million small indels were identified by comparing with the human reference genome. Among the SNVs, 129,441 were novel, and 10,315 nonsynonymous SNVs were found in 5,344 genes. SNVs were annotated for health consequences and high risk diseases, as well as possible influences on drug efficacy. We confirmed that the Pathan genome presented here is representative of this ethnic group by comparing it to a panel of Central Asians from the HGDP-CEPH panels typed for ~650 k SNPs. The mtDNA (H2) and Y haplogroup (L1) of this individual were also typical of his geographic region of origin. Finally, we reconstruct the demographic history by PSMC, which highlights a recent increase in effective population size compatible with admixture between European and Asian lineages expected in this geographic region.
We present a whole-genome sequence and analyses of an ethnic Pathan from the north-west province of Pakistan. It is a useful resource to understand genetic variation and human migration across the whole Asian continent.
Electronic supplementary material
The online version of this article (doi:10.1186/s12864-015-1290-1) contains supplementary material, which is available to authorized users.
PMCID: PMC4362645  PMID: 25887915
2.  Genome-wide analysis of DNA methylation patterns in horse 
BMC Genomics  2014;15(1):598.
DNA methylation is an epigenetic regulatory mechanism that plays an essential role in mediating biological processes and determining phenotypic plasticity in organisms. Although the horse reference genome and whole transcriptome data are publically available the global DNA methylation data are yet to be known.
We report the first genome-wide DNA methylation characteristics data from skeletal muscle, heart, lung, and cerebrum tissues of thoroughbred (TH) and Jeju (JH) horses, an indigenous Korea breed, respectively by methyl-DNA immunoprecipitation sequencing. The analysis of the DNA methylation patterns indicated that the average methylation density was the lowest in the promoter region, while the density in the coding DNA sequence region was the highest. Among repeat elements, a relatively high density of methylation was observed in long interspersed nuclear elements compared to short interspersed nuclear elements or long terminal repeat elements. We also successfully identified differential methylated regions through a comparative analysis of corresponding tissues from TH and JH, indicating that the gene body regions showed a high methylation density.
We provide report the first DNA methylation landscape and differentially methylated genomic regions (DMRs) of thoroughbred and Jeju horses, providing comprehensive DMRs maps of the DNA methylome. These data are invaluable resource to better understanding of epigenetics in the horse providing information for the further biological function analyses.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-598) contains supplementary material, which is available to authorized users.
PMCID: PMC4117963  PMID: 25027854
Thoroughbred horse; Jeju horse; Genome-wide DNA methylation; Differential methylated region (DMR); MeDIP-seq
3.  Whole transcriptome analyses of six thoroughbred horses before and after exercise using RNA-Seq 
BMC Genomics  2012;13:473.
Thoroughbred horses are the most expensive domestic animals, and their running ability and knowledge about their muscle-related diseases are important in animal genetics. While the horse reference genome is available, there has been no large-scale functional annotation of the genome using expressed genes derived from transcriptomes.
We present a large-scale analysis of whole transcriptome data. We sequenced the whole mRNA from the blood and muscle tissues of six thoroughbred horses before and after exercise. By comparing current genome annotations, we identified 32,361 unigene clusters spanning 51.83 Mb that contained 11,933 (36.87%) annotated genes. More than 60% (20,428) of the unigene clusters did not match any current equine gene model. We also identified 189,973 single nucleotide variations (SNVs) from the sequences aligned against the horse reference genome. Most SNVs (171,558 SNVs; 90.31%) were novel when compared with over 1.1 million equine SNPs from two SNP databases. Using differential expression analysis, we further identified a number of exercise-regulated genes: 62 up-regulated and 80 down-regulated genes in the blood, and 878 up-regulated and 285 down-regulated genes in the muscle. Six of 28 previously-known exercise-related genes were over-expressed in the muscle after exercise. Among the differentially expressed genes, there were 91 transcription factor-encoding genes, which included 56 functionally unknown transcription factor candidates that are probably associated with an early regulatory exercise mechanism. In addition, we found interesting RNA expression patterns where different alternative splicing forms of the same gene showed reversed expressions before and after exercising.
The first sequencing-based horse transcriptome data, extensive analyses results, deferentially expressed genes before and after exercise, and candidate genes that are related to the exercise are provided in this study.
PMCID: PMC3472166  PMID: 22971240
Transcriptome; Equus caballus; Gene expression; Racing performance
4.  Liverome: a curated database of liver cancer-related gene signatures with self-contained context information 
BMC Genomics  2011;12(Suppl 3):S3.
Hepatocellular carcinoma (HCC) is the fifth most common cancer worldwide. A number of molecular profiling studies have investigated the changes in gene and protein expression that are associated with various clinicopathological characteristics of HCC and generated a wealth of scattered information, usually in the form of gene signature tables. A database of the published HCC gene signatures would be useful to liver cancer researchers seeking to retrieve existing differential expression information on a candidate gene and to make comparisons between signatures for prioritization of common genes. A challenge in constructing such database is that a direct import of the signatures as appeared in articles would lead to a loss or ambiguity of their context information that is essential for a correct biological interpretation of a gene’s expression change. This challenge arises because designation of compared sample groups is most often abbreviated, ad hoc, or even missing from published signature tables. Without manual curation, the context information becomes lost, leading to uninformative database contents. Although several databases of gene signatures are available, none of them contains informative form of signatures nor shows comprehensive coverage on liver cancer. Thus we constructed Liverome, a curated database of liver cancer-related gene signatures with self-contained context information.
Liverome’s data coverage is more than three times larger than any other signature database, consisting of 143 signatures taken from 98 HCC studies, mostly microarray and proteome, and involving 6,927 genes. The signatures were post-processed into an informative and uniform representation and annotated with an itemized summary so that all context information is unambiguously self-contained within the database. The signatures were further informatively named and meaningfully organized according to ten functional categories for guided browsing. Its web interface enables a straightforward retrieval of known differential expression information on a query gene and a comparison of signatures to prioritize common genes. The utility of Liverome-collected data is shown by case studies in which useful biological insights on HCC are produced.
Liverome database provides a comprehensive collection of well-curated HCC gene signatures and straightforward interfaces for gene search and signature comparison as well. Liverome is available at
PMCID: PMC3333186  PMID: 22369201
5.  BioBarcode: a general DNA barcoding database and server platform for Asian biodiversity resources 
BMC Genomics  2009;10(Suppl 3):S8.
DNA barcoding provides a rapid, accurate, and standardized method for species-level identification using short DNA sequences. Such a standardized identification method is useful for mapping all the species on Earth, particularly when DNA sequencing technology is cheaply available. There are many nations in Asia with many biodiversity resources that need to be mapped and registered in databases.
We have built a general DNA barcode data processing system, BioBarcode, with open source software - which is a general purpose database and server. It uses mySQL RDBMS 5.0, BLAST2, and Apache httpd server. An exemplary database of BioBarcode has around 11,300 specimen entries (including GenBank data) and registers the biological species to map their genetic relationships. The BioBarcode database contains a chromatogram viewer which improves the performance in DNA sequence analyses.
Asia has a very high degree of biodiversity and the BioBarcode database server system aims to provide an efficient bioinformatics protocol that can be freely used by Asian researchers and research organizations interested in DNA barcoding. The BioBarcode promotes the rapid acquisition of biological species DNA sequence data that meet global standards by providing specialized services, and provides useful tools that will make barcoding cheaper and faster in the biodiversity community such as standardization, depository, management, and analysis of DNA barcode data. The system can be downloaded upon request, and an exemplary server has been constructed with which to build an Asian biodiversity system
PMCID: PMC2788395  PMID: 19958506
6.  COMUS: Clinician-Oriented locus-specific MUtation detection and deposition System 
BMC Genomics  2009;10(Suppl 3):S35.
A disease-causing mutation refers to a heritable genetic change that is associated with a specific phenotype (disease). The detection of a mutation from a patient's sample is critical for the diagnosis, treatment, and prognosis of the disease. There are numerous databases and applications with which to archive mutation data. However, none of them have been implemented with any automated bioinformatics tools for mutation detection and analysis starting from raw data materials from patients. We present a Locus Specific mutation DB (LSDB) construction system that supports both mutation detection and deposition in one package.
COMUS (Clinician-Oriented locus specific MUtation detection and deposition System) is a mutation detection and deposition system for developing specific LSDBs. COMUS contains 1) a DNA sequence mutation analysis method for clinicians' mutation data identification and deposition and 2) a curation system for variation detection from clinicians' input data. To embody the COMUS system and to validate its clinical utility, we have chosen the disease hemophilia as a test database. A set of data files from bench experiments and clinical information from hemophilia patients were tested on the LSDB, KoHemGene, which has proven to be a clinician-friendly interface for mutation detection and deposition.
COMUS is a bioinformatics system for detecting and depositing new mutations from patient DNA with a clinician-friendly interface. LSDBs made using COMUS will promote the clinical utility of LSDBs. COMUS is available at
PMCID: PMC2788389  PMID: 19958500
7.  PDbase: a database of Parkinson's Disease-related genes and genetic variation using substantia nigra ESTs 
BMC Genomics  2009;10(Suppl 3):S32.
Parkinson's disease (PD) is one of the most common neurodegenerative disorders, clinically characterized by impaired motor function. Since the etiology of PD is diverse and complex, many researchers have created PD-related research resources. However, resources for brain and PD studies are still lacking. Therefore, we have constructed a database of PD-related gene and genetic variations using the substantia nigra (SN) in PD and normal tissues. In addition, we integrated PD-related information from several resources.
We collected the 6,130 SN expressed sequenced tags (ESTs) from brain SN normal tissues and PD patients SN tissues using full-cDNA library and normalized cDNA library construction methods from our previous study. The SN ESTs were clustered in 2,951 unigene clusters and assigned in 2,678 genes. We then found up-regulated 57 genes and down-regulated 48 genes by comparing normal and PD SN ESTs frequencies with over 0.9 cut-off probability of differential expression based on the Audic and Claverie method. In addition, we integrated disease-related information from public resources. To examine the characteristics of these PD-related genes, we analyzed alternative splicing events, single nucleotide polymorphism (SNP) markers located in the gene regions, repeat elements, gene regulation elements, and pathways and protein-protein interaction networks.
We constructed the PDbase database to capture the PD-related gene, genetic variation, and functional elements. This database contains 2,698 PD-related genes through ESTs discovered from human normal and PD patients SN tissues, and through integrating several public resources. PDbase provides the mitochondrion proteins, microRNA gene regulation elements, single nucleotide polymorphisms (SNPs) markers within PD-related gene structures, repeat elements, and pathways and networks with protein-protein interaction information. The PDbase information can aid in understanding the causation of PD. It is available at Supplementary data is available at
PMCID: PMC2788386  PMID: 19958497
8.  MitoInteractome: Mitochondrial protein interactome database, and its application in 'aging network' analysis 
BMC Genomics  2009;10(Suppl 3):S20.
Mitochondria play a vital role in the energy production and apoptotic process of eukaryotic cells. Proteins in the mitochondria are encoded by nuclear and mitochondrial genes. Owing to a large increase in the number of identified mitochondrial protein sequences and completed mitochondrial genomes, it has become necessary to provide a web-based database of mitochondrial protein information.
We present 'MitoInteractome', a consolidated web-based portal containing a wealth of information on predicted protein-protein interactions, physico-chemical properties, polymorphism, and diseases related to the mitochondrial proteome. MitoInteractome contains 6,549 protein sequences which were extracted from the following databases: SwissProt, MitoP, MitoProteome, HPRD and Gene Ontology database. The first general mitochondrial interactome has been constructed based on the concept of 'homologous interaction' using PSIMAP (Protein Structural Interactome MAP) and PEIMAP (Protein Experimental Interactome MAP). Using the above mentioned methods, protein-protein interactions were predicted for 74 species. The mitochondrial protein interaction data of humans was used to construct a network for the aging process. Analysis of the 'aging network' gave us vital insights into the interactions among proteins that influence the aging process.
MitoInteractome is a comprehensive database that would (1) aid in increasing our understanding of the molecular functions and interaction networks of mitochondrial proteins, (2) help in identifying new target proteins for experimental research using predicted protein-protein interaction information, and (3) help in identifying biomarkers for diagnosis and new molecular targets for drug development related to mitochondria. MitoInteractome is available at
PMCID: PMC2788373  PMID: 19958484
9.  PutidaNET: Interactome database service and network analysis of Pseudomonas putida KT2440 
BMC Genomics  2009;10(Suppl 3):S18.
Pseudomonas putida KT2440 (P. putida KT2440) is a highly versatile saprophytic soil bacterium. It is a certified bio-safety host for transferring foreign genes. Therefore, the bacterium is used as a model organism for genetic and physiological studies and for the development of biotechnological applications. In order to provide a more systematic application of the organism, we have constructed a protein-protein interaction (PPI) network analysis system of P. putida KT2440.
PutidaNET is a comprehensive interaction database and server of P. putida KT2440 which is generated from three protein-protein interaction (PPI) methods. We used PSIMAP (Protein Structural Interactome MAP), PEIMAP (Protein Experimental Interactome MAP), and Domain-domain interactions using iPfam. PutidaNET contains 3,254 proteins, and 82,019 possible interactions consisting of 61,011 (PSIMAP), 4,293 (PEIMAP), and 30,043 (iPfam) interaction pairs except for self interaction. Also, we performed a case study by integrating a protein interaction network and experimental 1-DE/MS-MS analysis data P. putida. We found that 1) major functional modules are involved in various metabolic pathways and ribosomes, and 2) existing PPI sub-networks that are specific to succinate or benzoate metabolism are not in the center as predicted.
We introduce the PutidaNET which provides predicted interaction partners and functional analyses such as physicochemical properties, KEGG pathway assignment, and Gene Ontology mapping of P. putida KT2440 PutidaNET is freely available at
PMCID: PMC2788370  PMID: 19958481
10.  MitoVariome: a variome database of human mitochondrial DNA 
BMC Genomics  2009;10(Suppl 3):S12.
Mitochondrial sequence variation provides critical information for studying human evolution and variation. Mitochondrial DNA provides information on the origin of humans, and plays a substantial role in forensics, degenerative diseases, cancers, and aging process. Typically, human mitochondrial DNA has various features such as HVSI, HVSII, single-nucleotide polymorphism (SNP), restriction enzyme sites, and short tandem repeat (STR).
We present a variome database (MitoVariome) of human mitochondrial DNA sequences. Queries against MitoVariome can be made using accession numbers or haplogroup/continent. Query results are presented not only in text but also in HTML tables to report extensive mitochondrial sequence variation information. The variation information includes repeat pattern, restriction enzyme site polymorphism, short tandem repeat, disease information as well as single nucleotide polymorphism. It also provides a graphical interface as Gbrowse displaying all variations at a glance. The web interface also provides the tool for assigning haplogroup based on the haplogroup-diagnostic system with complete human mitochondrial SNP position list and for retrieving sequences that users query against by using accession numbers.
MitoVariome is a freely accessible web application and database that enables human mitochondrial genome researchers to study genetic variation in mitochondrial genome with textual and graphical views accompanied by assignment function of haplogrouping if users submit their own data. Hence, the MitoVariome containing many kinds of variation features in the human mitochondrial genome will be useful for understanding mitochondrial variations of each individual, haplogroup, or geographical location to elucidate the history of human evolution.
PMCID: PMC2788364  PMID: 19958475

Results 1-10 (10)