Search tips
Search criteria

Results 1-23 (23)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
Document Types
1.  Clustering by neurocognition for fine-mapping of the schizophrenia susceptibility loci on chromosome 6p 
Genes, brain, and behavior  2009;8(8):785-794.
Chromosome 6p is one of the most commonly implicated regions in the genome-wide linkage scans of schizophrenia, whereas further association studies for markers in this region were inconsistent likely due to heterogeneity. This study aimed to identify more homogeneous subgroups of families for fine mapping on regions around markers D6S296 and D6S309 (both in 6p24.3) as well as D6S274 (in 6p22.3) by means of similarity in neurocognitive functioning. A total of 160 families of patients with schizophrenia comprising at least two affected siblings who had data for 8 neurocognitive test variables of the Continuous Performance Test (CPT) and the Wisconsin Card Sorting Test (WCST) were subjected to cluster analysis with data visualization using the test scores of both affected siblings. Family clusters derived were then used separately in family-based association tests for 64 single nucleotide polymorphisms covering the region of 6p24.3 and 6p22.3. Three clusters were derived from the family-based clustering, with deficit cluster 1 representing deficit on the CPT, deficit cluster 2 representing deficit on both the CPT and the WCST, and a third cluster of non-deficit. After adjustment using false discovery rate for multiple testing, SNP rs13873 and haplotype rs1225934-rs13873 on BMP6-TXNDC5 genes were significantly associated with schizophrenia for the deficit cluster 1 but not for the deficit cluster 2 or non-deficit cluster. Our results provide further evidence that the BMP6-TXNDC5 locus on 6p24.3 may play a role in the selective impairments on sustained attention of schizophrenia.
PMCID: PMC4286260  PMID: 19694819
endophenotype; sustained attention deficit; executive dysfunction; candidate gene; cluster analysis; schizophrenia
2.  Increased gene expression of FOXP1 in patients with autism spectrum disorders 
Molecular Autism  2013;4:23.
Comparative gene expression profiling analysis is useful in discovering differentially expressed genes associated with various diseases, including mental disorders. Autism spectrum disorders (ASD) are a group of complex childhood-onset neurodevelopmental and genetic disorders characterized by deficits in language development and verbal communication, impaired reciprocal social interaction, and the presence of repetitive behaviors or restricted interests. The study aimed to identify novel genes associated with the pathogenesis of ASD.
We conducted comparative total gene expression profiling analysis of lymphoblastoid cell lines (LCL) between 16 male patients with ASD and 16 male control subjects to screen differentially expressed genes associated with ASD. We verified one of the differentially expressed genes, FOXP1, using real-time quantitative PCR (RT-qPCR) in a sample of 83 male patients and 83 male controls that included the initial 16 male patients and male controls, respectively.
A total of 252 differentially expressed probe sets representing 202 genes were detected between the two groups, including 89 up- and 113 downregulated genes in the ASD group. RT-qPCR verified significant elevation of the FOXP1 gene transcript of LCL in a sample of 83 male patients (10.46 ± 11.34) compared with 83 male controls (5.17 ± 8.20, P = 0.001).
Comparative gene expression profiling analysis of LCL is useful in discovering novel genetic markers associated with ASD. Elevated gene expression of FOXP1 might contribute to the pathogenesis of ASD.
Clinical trial registration
Identifier: NCT00494754
PMCID: PMC3723673  PMID: 23815876
Autism; FOXP1; Expression microarray; Genetics; Lymphoblastoid cell line
3.  The DAO Gene Is Associated with Schizophrenia and Interacts with Other Genes in the Taiwan Han Chinese Population 
PLoS ONE  2013;8(3):e60099.
Schizophrenia is a highly heritable disease with a polygenic mode of inheritance. Many studies have contributed to our understanding of the genetic underpinnings of schizophrenia, but little is known about how interactions among genes affect the risk of schizophrenia. This study aimed to assess the associations and interactions among genes that confer vulnerability to schizophrenia and to examine the moderating effect of neuropsychological impairment.
We analyzed 99 SNPs from 10 candidate genes in 1,512 subject samples. The permutation-based single-locus, multi-locus association tests, and a gene-based multifactorial dimension reduction procedure were used to examine genetic associations and interactions to schizophrenia.
We found that no single SNP was significantly associated with schizophrenia. However, a risk haplotype, namely A-T-C of the SNP triplet rsDAO7-rsDAO8-rsDAO13 of the DAO gene, was strongly associated with schizophrenia. Interaction analyses identified multiple between-gene and within-gene interactions. Between-gene interactions including DAO*DISC1, DAO*NRG1 and DAO*RASD2 and a within-gene interaction for CACNG2 were found among schizophrenia subjects with severe sustained attention deficits, suggesting a modifying effect of impaired neuropsychological functioning. Other interactions such as the within-gene interaction of DAO and the between-gene interaction of DAO and PTK2B were consistently identified regardless of stratification by neuropsychological dysfunction. Importantly, except for the within-gene interaction of CACNG2, all of the identified risk haplotypes and interactions involved SNPs from DAO.
These results suggest that DAO, which is involved in the N-methyl-d-aspartate receptor regulation, signaling and glutamate metabolism, is the master gene of the genetic associations and interactions underlying schizophrenia. Besides, the interaction between DAO and RASD2 has provided an insight in integrating the glutamate and dopamine hypotheses of schizophrenia.
PMCID: PMC3610748  PMID: 23555897
4.  Morus alba and active compound oxyresveratrol exert anti-inflammatory activity via inhibition of leukocyte migration involving MEK/ERK signaling 
Morus alba has long been used in traditional Chinese medicine to treat inflammatory diseases; however, the scientific basis for such usage and the mechanism of action are not well understood. This study investigated the action of M. alba on leukocyte migration, one key step in inflammation.
Gas chromatography-mass spectrometry (GC-MS) and cluster analyses of supercritical CO2 extracts of three Morus species were performed for chemotaxonomy-aided plant authentication. Phytochemistry and CXCR4-mediated chemotaxis assays were used to characterize the chemical and biological properties of M. alba and its active compound, oxyresveratrol. fluorescence-activated cell sorting (FACS) and Western blot analyses were conducted to determine the mode of action of oxyresveratrol.
Chemotaxonomy was used to help authenticate M. alba. Chemotaxis-based isolation identified oxyresveratrol as an active component in M. alba. Phytochemical and chemotaxis assays showed that the crude extract, ethyl acetate fraction and oxyresveratrol from M. alba suppressed cell migration of Jurkat T cells in response to SDF-1. Mechanistic study indicated that oxyresveratrol diminished CXCR4-mediated T-cell migration via inhibition of the MEK/ERK signaling cascade.
A combination of GC-MS and cluster analysis techniques are applicable for authentication of the Morus species. Anti-inflammatory benefits of M. alba and its active compound, oxyresveratrol, may involve the inhibition of CXCR-4-mediated chemotaxis and MEK/ERK pathway in T and other immune cells.
PMCID: PMC3639811  PMID: 23433072
Chemotaxis; CXCR4; Morus; Phytochemistry and T-cells
5.  Recursive Feature Selection with Significant Variables of Support Vectors 
The development of DNA microarray makes researchers screen thousands of genes simultaneously and it also helps determine high- and low-expression level genes in normal and disease tissues. Selecting relevant genes for cancer classification is an important issue. Most of the gene selection methods use univariate ranking criteria and arbitrarily choose a threshold to choose genes. However, the parameter setting may not be compatible to the selected classification algorithms. In this paper, we propose a new gene selection method (SVM-t) based on the use of t-statistics embedded in support vector machine. We compared the performance to two similar SVM-based methods: SVM recursive feature elimination (SVMRFE) and recursive support vector machine (RSVM). The three methods were compared based on extensive simulation experiments and analyses of two published microarray datasets. In the simulation experiments, we found that the proposed method is more robust in selecting informative genes than SVMRFE and RSVM and capable to attain good classification performance when the variations of informative and noninformative genes are different. In the analysis of two microarray datasets, the proposed method yields better performance in identifying fewer genes with good prediction accuracy, compared to SVMRFE and RSVM.
PMCID: PMC3426197  PMID: 22927888
6.  Integrative analysis of single nucleotide polymorphisms and gene expression efficiently distinguishes samples from closely related ethnic populations 
BMC Genomics  2012;13:346.
Ancestry informative markers (AIMs) are a type of genetic marker that is informative for tracing the ancestral ethnicity of individuals. Application of AIMs has gained substantial attention in population genetics, forensic sciences, and medical genetics. Single nucleotide polymorphisms (SNPs), the materials of AIMs, are useful for classifying individuals from distinct continental origins but cannot discriminate individuals with subtle genetic differences from closely related ancestral lineages. Proof-of-principle studies have shown that gene expression (GE) also is a heritable human variation that exhibits differential intensity distributions among ethnic groups. GE supplies ethnic information supplemental to SNPs; this motivated us to integrate SNP and GE markers to construct AIM panels with a reduced number of required markers and provide high accuracy in ancestry inference. Few studies in the literature have considered GE in this aspect, and none have integrated SNP and GE markers to aid classification of samples from closely related ethnic populations.
We integrated a forward variable selection procedure into flexible discriminant analysis to identify key SNP and/or GE markers with the highest cross-validation prediction accuracy. By analyzing genome-wide SNP and/or GE markers in 210 independent samples from four ethnic groups in the HapMap II Project, we found that average testing accuracies for a majority of classification analyses were quite high, except for SNP-only analyses that were performed to discern study samples containing individuals from two close Asian populations. The average testing accuracies ranged from 0.53 to 0.79 for SNP-only analyses and increased to around 0.90 when GE markers were integrated together with SNP markers for the classification of samples from closely related Asian populations. Compared to GE-only analyses, integrative analyses of SNP and GE markers showed comparable testing accuracies and a reduced number of selected markers in AIM panels.
Integrative analysis of SNP and GE markers provides high-accuracy and/or cost-effective classification results for assigning samples from closely related or distantly related ancestral lineages to their original ancestral populations. User-friendly BIASLESS (Biomarkers Identification and Samples Subdivision) software was developed as an efficient tool for selecting key SNP and/or GE markers and then building models for sample subdivision. BIASLESS was programmed in R and R-GUI and is available online at
PMCID: PMC3453505  PMID: 22839760
Single nucleotide polymorphism (SNP); Allele frequency; Gene expression; HapMap; Classification analysis; Ancestry informative marker (AIM)
7.  Mixed Sequence Reader: A Program for Analyzing DNA Sequences with Heterozygous Base Calling 
The Scientific World Journal  2012;2012:365104.
The direct sequencing of PCR products generates heterozygous base-calling fluorescence chromatograms that are useful for identifying single-nucleotide polymorphisms (SNPs), insertion-deletions (indels), short tandem repeats (STRs), and paralogous genes. Indels and STRs can be easily detected using the currently available Indelligent or ShiftDetector programs, which do not search reference sequences. However, the detection of other genomic variants remains a challenge due to the lack of appropriate tools for heterozygous base-calling fluorescence chromatogram data analysis. In this study, we developed a free web-based program, Mixed Sequence Reader (MSR), which can directly analyze heterozygous base-calling fluorescence chromatogram data in .abi file format using comparisons with reference sequences. The heterozygous sequences are identified as two distinct sequences and aligned with reference sequences. Our results showed that MSR may be used to (i) physically locate indel and STR sequences and determine STR copy number by searching NCBI reference sequences; (ii) predict combinations of microsatellite patterns using the Federal Bureau of Investigation Combined DNA Index System (CODIS); (iii) determine human papilloma virus (HPV) genotypes by searching current viral databases in cases of double infections; (iv) estimate the copy number of paralogous genes, such as β-defensin 4 (DEFB4) and its paralog HSPDP3.
PMCID: PMC3385616  PMID: 22778697
8.  Preservation of Ranking Order in the Expression of Human Housekeeping Genes 
PLoS ONE  2011;6(12):e29314.
Housekeeping (HK) genes fulfill the basic needs for a cell to survive and function properly. Their ubiquitous expression, originally thought to be constant, can vary from tissue to tissue, but this variation remains largely uncharacterized and it could not be explained by previously identified properties of HK genes such as short gene length and high GC content. By analyzing microarray expression data for human genes, we uncovered a previously unnoted characteristic of HK gene expression, namely that the ranking order of their expression levels tends to be preserved from one tissue to another. Further analysis by tensor product decomposition and pathway stratification identified three main factors of the observed ranking preservation, namely that, compared to those of non-HK (NHK) genes, the expression levels of HK genes show a greater degree of dispersion (less overlap), stableness (a smaller variation in expression between tissues), and correlation of expression. Our results shed light on regulatory mechanisms of HK gene expression that are probably different for different HK genes or pathways, but are consistent and coordinated in different tissues.
PMCID: PMC3245260  PMID: 22216246
9.  Analysis of human meiotic recombination events with a parent-sibling tracing approach 
BMC Genomics  2011;12:434.
Meiotic recombination ensures that each child inherits distinct genetic materials from each parent, but the distribution of crossovers along meiotic chromosomes remains difficult to identify. In this study, we developed a parent-sibling tracing (PST) approach from previously reported methods to identify meiotic crossover sites of GEO GSE6754 data set. This approach requires only the single nucleotide polymorphism (SNP) data of the pedigrees of both parents and at least two of children.
Compared to other SNP-based algorithms (identity by descent or pediSNP), fewer uninformative SNPs were derived with the use of PST. Analysis of a GEO GSE6754 data set containing 2,145 maternal and paternal meiotic events revealed that the pattern and distribution of paternal and maternal recombination sites vary along the chromosomes. Lower crossover rates near the centromeres were more prominent in males than in females. Based on analysis of repetitive sequences, we also showed that recombination hotspots are positively correlated with SINE/MIR repetitive elements and negatively correlated with LINE/L1 elements. The number of meiotic recombination events was positively correlated with the number of shorter tandem repeat sequences.
The advantages of the PST approach include the ability to use only two-generation pedigrees with two siblings and the ability to perform gender-specific analyses of repetitive elements and tandem repeat sequences while including fewer uninformative SNP regions in the results.
PMCID: PMC3186786  PMID: 21867557
10.  MicroRNA Expression Aberration as Potential Peripheral Blood Biomarkers for Schizophrenia 
PLoS ONE  2011;6(6):e21635.
Since brain tissue is not readily accessible, a new focus in search of biomarkers for schizophrenia is blood-based expression profiling of non-protein coding genes such as microRNAs (miRNAs), which regulate gene expression by inhibiting the translation of messenger RNAs. This study aimed to identify potential miRNA signature for schizophrenia by comparing genome-wide miRNA expression profiles in patients with schizophrenia vs. healthy controls. A genome-wide miRNA expression profiling was performed using a Taqman array of 365 human miRNAs in the mononuclear leukocytes of a learning set of 30 cases and 30 controls. The discriminating performance of potential biomarkers was validated in an independent testing set of 60 cases and 30 controls. The expression levels of the miRNA signature were then evaluated for their correlation with the patients' clinical symptoms, neurocognitive performances, and neurophysiological functions. A seven-miRNA signature (hsa-miR-34a, miR-449a, miR-564, miR-432, miR-548d, miR-572 and miR-652) was derived from a supervised classification with internal cross-validation, with an area under the curve (AUC) of receiver operating characteristics of 93%. The putative signature was then validated in the testing set, with an AUC of 85%. Among these miRNAs, miR-34a was differentially expressed between cases and controls in both the learning (P = 0.005) and the testing set (P = 0.002). These miRNAs were differentially correlated with patients' negative symptoms, neurocognitive performance scores, and event-related potentials. The results indicated that the mononuclear leukocyte-based miRNA profiling is a feasible way to identify biomarkers for schizophrenia, and the seven-miRNA signature warrants further investigation.
PMCID: PMC3126851  PMID: 21738743
11.  SAQC: SNP Array Quality Control 
BMC Bioinformatics  2011;12:100.
Genome-wide single-nucleotide polymorphism (SNP) arrays containing hundreds of thousands of SNPs from the human genome have proven useful for studying important human genome questions. Data quality of SNP arrays plays a key role in the accuracy and precision of downstream data analyses. However, good indices for assessing data quality of SNP arrays have not yet been developed.
We developed new quality indices to measure the quality of SNP arrays and/or DNA samples and investigated their statistical properties. The indices quantify a departure of estimated individual-level allele frequencies (AFs) from expected frequencies via standardized distances. The proposed quality indices followed lognormal distributions in several large genomic studies that we empirically evaluated. AF reference data and quality index reference data for different SNP array platforms were established based on samples from various reference populations. Furthermore, a confidence interval method based on the underlying empirical distributions of quality indices was developed to identify poor-quality SNP arrays and/or DNA samples. Analyses of authentic biological data and simulated data show that this new method is sensitive and specific for the detection of poor-quality SNP arrays and/or DNA samples.
This study introduces new quality indices, establishes references for AFs and quality indices, and develops a detection method for poor-quality SNP arrays and/or DNA samples. We have developed a new computer program that utilizes these methods called SNP Array Quality Control (SAQC). SAQC software is written in R and R-GUI and was developed as a user-friendly tool for the visualization and evaluation of data quality of genome-wide SNP arrays. The program is available online (
PMCID: PMC3101186  PMID: 21501472
12.  Evaluation of Pulsed-Field Gel Electrophoresis Profiles for Identification of Salmonella Serotypes▿ †  
Journal of Clinical Microbiology  2010;48(9):3122-3126.
Pulsed-field gel electrophoresis (PFGE) is a standard typing method for isolates from Salmonella outbreaks and epidemiological investigations. Eight hundred sixty-six Salmonella enterica isolates from eight serotypes, including Heidelberg (n = 323), Javiana (n = 200), Typhimurium (n = 163), Newport (n = 93), Enteritidis (n = 45), Dublin (n = 25), Pullorum (n = 9), and Choleraesuis (n = 8), were subjected to PFGE, and their profiles were analyzed by random forest classification and compared to conventional hierarchical cluster analysis to determine potential predictive relationships between PFGE banding patterns and particular serotypes. Cluster analysis displayed only the underlying similarities and relationships of the isolates from the eight serotypes. However, for serotype prediction of a nonserotyped Salmonella isolate from its PFGE pattern, random forest classification provided better accuracy than conventional cluster analysis. Discriminatory DNA band class markers were identified for distinguishing Salmonella serotype Heidelberg, Javiana, Typhimurium, and Newport isolates.
PMCID: PMC2937721  PMID: 20631109
13.  dbDNV: a resource of duplicated gene nucleotide variants in human genome 
Nucleic Acids Research  2010;39(Database issue):D920-D925.
Gene duplications are scattered widely throughout the human genome. A single-base difference located in nearly identical duplicated segments may be misjudged as a single nucleotide polymorphism (SNP) from individuals. This imperfection is undistinguishable in current genotyping methods. As the next-generation sequencing technologies become more popular for sequence-based association studies, numerous ambiguous SNPs are rapidly accumulated. Thus, analyzing duplication variations in the reference genome to assist in preventing false positive SNPs is imperative. We have identified >10% of human genes associated with duplicated gene loci (DGL). Through meticulous sequence alignments of DGL, we systematically designated 1 236 956 variations as duplicated gene nucleotide variants (DNVs). The DNV database (dbDNV) ( has been established to promote more accurate variation annotation. Aside from the flat file download, users can explore the gene-related duplications and the associated DNVs by DGL and DNV searches, respectively. In addition, the dbDNV contains 304 110 DNV-coupled SNPs. From DNV-coupled SNP search, users observe which SNP records are also variants among duplicates. This is useful while ∼58% of exonic SNPs in DGL are DNV-coupled. Because of high accumulation of ambiguous SNPs, we suggest that annotating SNPs with DNVs possibilities should improve association studies of these variants with human diseases.
PMCID: PMC3013738  PMID: 21097891
14.  Gene-oriented ortholog database: a functional comparison platform for orthologous loci 
The accumulation of complete genomic sequences enhances the need for functional annotation. Associating existing functional annotation of orthologs can speed up the annotation process and even examine the existing annotation. However, current protein sequence-based ortholog databases provide ambiguous and incomplete orthology in eukaryotes. It is because that isoforms, derived by alternative splicing (AS), often share higher sequence similarity to interfere the sequence-based identification. Gene-Oriented Ortholog Database (GOOD) employs genomic locations of transcripts to cluster AS-derived isoforms prior to ortholog delineation to eliminate the interference from AS. From the gene-oriented presentation, isoforms can be clearly associated to their genes to provide comprehensive ortholog information and further be discriminated from paralogs. Aside from, displaying clusters of isoforms between orthologous genes can present the evolution variation at the transcription level. Based on orthology, GOOD additionally comprises functional annotation from the Gene Ontology (GO) database. However, there exist redundant annotations, both parent and child terms assigned to the same gene, in the GO database. It is difficult to precisely draw the numerical comparison of term counts between orthologous genes annotated with redundant terms. Instead of the description only, GOOD further provides the GO graphs to reveal hierarchical-like relationships among divergent functionalities. Therefore, the redundancy of GO terms can be examined, and the context among compared terms is more comprehensive. In sum, GOOD can improve the interpretation in the molecular function from experiments in the model organism and provide clear comparative genomic annotation across organisms.
Database URL:
PMCID: PMC2860896  PMID: 20428317
15.  Microarray labeling extension values: laboratory signatures for Affymetrix GeneChips 
Nucleic Acids Research  2009;37(8):e61.
Interlaboratory comparison of microarray data, even when using the same platform, imposes several challenges to scientists. RNA quality, RNA labeling efficiency, hybridization procedures and data-mining tools can all contribute variations in each laboratory. In Affymetrix GeneChips, about 11–20 different 25-mer oligonucleotides are used to measure the level of each transcript. Here, we report that ‘labeling extension values (LEVs)’, which are correlation coefficients between probe intensities and probe positions, are highly correlated with the gene expression levels (GEVs) on eukayotic Affymetrix microarray data. By analyzing LEVs and GEVs in the publicly available 2414 cel files of 20 Affymetrix microarray types covering 13 species, we found that correlations between LEVs and GEVs only exist in eukaryotic RNAs, but not in prokaryotic ones. Surprisingly, Affymetrix results of the same specimens that were analyzed in different laboratories could be clearly differentiated only by LEVs, leading to the identification of ‘laboratory signatures’. In the examined dataset, GSE10797, filtering out high-LEV genes did not compromise the discovery of biological processes that are constructed by differentially expressed genes. In conclusion, LEVs provide a new filtering parameter for microarray analysis of gene expression and it may improve the inter- and intralaboratory comparability of Affymetrix GeneChips data.
PMCID: PMC2677891  PMID: 19295132
16.  Genomics and proteomics of immune modulatory effects of a butanol fraction of echinacea purpurea in human dendritic cells 
BMC Genomics  2008;9:479.
Echinacea spp. extracts and the derived phytocompounds have been shown to induce specific immune cell activities and are popularly used as food supplements or nutraceuticals for immuno-modulatory functions. Dendritic cells (DCs), the most potent antigen presenting cells, play an important role in both innate and adaptive immunities. In this study, we investigated the specific and differential gene expression in human immature DCs (iDCs) in response to treatment with a butanol fraction containing defined bioactive phytocompounds extracted from stems and leaves of Echinacea purpurea, that we denoted [BF/S+L/Ep].
Affymetrix DNA microarray results showed significant up regulation of specific genes for cytokines (IL-8, IL-1β, and IL-18) and chemokines (CXCL 2, CCL 5, and CCL 2) within 4 h after [BF/S+L/Ep] treatment of iDCs. Bioinformatics analysis of genes expressed in [BF/S+L/Ep]-treated DCs revealed a key-signaling network involving a number of immune-modulatory molecules leading to the activation of a downstream molecule, adenylate cyclase 8. Proteomic analysis showed increased expression of antioxidant and cytoskeletal proteins after treatment with [BF/S+L/Ep] and cichoric acid.
This study provides information on candidate target molecules and molecular signaling mechanisms for future systematic research into the immune-modulatory activities of an important traditional medicinal herb and its derived phytocompounds.
PMCID: PMC2571112  PMID: 18847511
17.  A method for analyzing censored survival phenotype with gene expression data 
BMC Bioinformatics  2008;9:417.
Survival time is an important clinical trait for many disease studies. Previous works have shown certain relationship between patients' gene expression profiles and survival time. However, due to the censoring effects of survival time and the high dimensionality of gene expression data, effective and unbiased selection of a gene expression signature to predict survival probabilities requires further study.
We propose a method for an integrated study of survival time and gene expression. This method can be summarized as a two-step procedure: in the first step, a moderate number of genes are pre-selected using correlation or liquid association (LA). Imputation and transformation methods are employed for the correlation/LA calculation. In the second step, the dimension of the predictors is further reduced using the modified sliced inverse regression for censored data (censorSIR).
The new method is tested via both simulated and real data. For the real data application, we employed a set of 295 breast cancer patients and found a linear combination of 22 gene expression profiles that are significantly correlated with patients' survival rate.
By an appropriate combination of feature selection and dimension reduction, we find a method of identifying gene expression signatures which is effective for survival prediction.
PMCID: PMC2579309  PMID: 18837994
18.  Designating eukaryotic orthology via processed transcription units 
Nucleic Acids Research  2008;36(10):3436-3442.
Orthology is a widely used concept in comparative and evolutionary genomics. In addition to prokaryotic orthology, delineating eukaryotic orthology has provided insight into the evolution of higher organisms. Indeed, many eukaryotic ortholog databases have been established for this purpose. However, unlike prokaryotes, alternative splicing (AS) has hampered eukaryotic orthology assignments. Therefore, existing databases likely contain ambiguous eukaryotic ortholog relationships and possibly misclassify alternatively spliced protein isoforms as in-paralogs, which are duplicated genes that arise following speciation. Here, we propose a new approach for designating eukaryotic orthology using processed transcription units, and we present an orthology database prototype using the human and mouse genomes. Currently existing programs cover less than 69% of the human reference sequences when assigning human/mouse orthologs. In contrast, our method encompasses up to 80% of the human reference sequences. Moreover, the ortholog database presented herein is more than 92% consistent with the existing databases. In addition to managing AS, this approach is capable of identifying orthologs of embedded genes and fusion genes using syntenic evidence. In summary, this new approach is sensitive, specific and can generate a more comprehensive and accurate compilation of eukaryotic orthologs.
PMCID: PMC2425467  PMID: 18445630
19.  Methods for simultaneously identifying coherent local clusters with smooth global patterns in gene expression profiles 
BMC Bioinformatics  2008;9:155.
The hierarchical clustering tree (HCT) with a dendrogram [1] and the singular value decomposition (SVD) with a dimension-reduced representative map [2] are popular methods for two-way sorting the gene-by-array matrix map employed in gene expression profiling. While HCT dendrograms tend to optimize local coherent clustering patterns, SVD leading eigenvectors usually identify better global grouping and transitional structures.
This study proposes a flipping mechanism for a conventional agglomerative HCT using a rank-two ellipse (R2E, an improved SVD algorithm for sorting purpose) seriation by Chen [3] as an external reference. While HCTs always produce permutations with good local behaviour, the rank-two ellipse seriation gives the best global grouping patterns and smooth transitional trends. The resulting algorithm automatically integrates the desirable properties of each method so that users have access to a clustering and visualization environment for gene expression profiles that preserves coherent local clusters and identifies global grouping trends.
We demonstrate, through four examples, that the proposed method not only possesses better numerical and statistical properties, it also provides more meaningful biomedical insights than other sorting algorithms. We suggest that sorted proximity matrices for genes and arrays, in addition to the gene-by-array expression matrix, can greatly aid in the search for comprehensive understanding of gene expression structures. Software for the proposed methods can be obtained at .
PMCID: PMC2322988  PMID: 18366693
20.  Gene selection with multiple ordering criteria 
BMC Bioinformatics  2007;8:74.
A microarray study may select different differentially expressed gene sets because of different selection criteria. For example, the fold-change and p-value are two commonly known criteria to select differentially expressed genes under two experimental conditions. These two selection criteria often result in incompatible selected gene sets. Also, in a two-factor, say, treatment by time experiment, the investigator may be interested in one gene list that responds to both treatment and time effects.
We propose three layer ranking algorithms, point-admissible, line-admissible (convex), and Pareto, to provide a preference gene list from multiple gene lists generated by different ranking criteria. Using the public colon data as an example, the layer ranking algorithms are applied to the three univariate ranking criteria, fold-change, p-value, and frequency of selections by the SVM-RFE classifier. A simulation experiment shows that for experiments with small or moderate sample sizes (less than 20 per group) and detecting a 4-fold change or less, the two-dimensional (p-value and fold-change) convex layer ranking selects differentially expressed genes with generally lower FDR and higher power than the standard p-value ranking. Three applications are presented. The first application illustrates a use of the layer rankings to potentially improve predictive accuracy. The second application illustrates an application to a two-factor experiment involving two dose levels and two time points. The layer rankings are applied to selecting differentially expressed genes relating to the dose and time effects. In the third application, the layer rankings are applied to a benchmark data set consisting of three dilution concentrations to provide a ranking system from a long list of differentially expressed genes generated from the three dilution concentrations.
The layer ranking algorithms are useful to help investigators in selecting the most promising genes from multiple gene lists generated by different filter, normalization, or analysis methods for various objectives.
PMCID: PMC1829166  PMID: 17338815
21.  Design of microarray probes for virus identification and detection of emerging viruses at the genus level 
BMC Bioinformatics  2006;7:232.
Most virus detection methods are geared towards the detection of specific single viruses or just a few known targets, and lack the capability to uncover the novel viruses that cause emerging viral infections. To address this issue, we developed a computational method that identifies the conserved viral sequences at the genus level for all viral genomes available in GenBank, and established a virus probe library. The virus probes are used not only to identify known viruses but also for discerning the genera of emerging or uncharacterized ones.
Using the microarray approach, the identity of the virus in a test sample is determined by the signals of both genus and species-specific probes. The genera of emerging and uncharacterized viruses are determined based on hybridization of the viral sequences to the conserved probes for the existing viral genera. A detection and classification procedure to determine the identity of a virus directly from detection signals results in the rapid identification of the virus.
We have demonstrated the validity and feasibility of the above strategy with a small number of viral samples. The probe design algorithm can be applied to any publicly available viral sequence database. The strategy of using separate genus and species probe sets enables the use of a straightforward virus identity calculation directly based on the hybridization signals. Our virus identification strategy has great potential in the diagnosis of viral infections. The virus genus and specific probe database and the associated summary tables are available at
PMCID: PMC1523220  PMID: 16643672
22.  Molecular signature of clinical severity in recovering patients with severe acute respiratory syndrome coronavirus (SARS-CoV) 
BMC Genomics  2005;6:132.
Severe acute respiratory syndrome (SARS), a recent epidemic human disease, is caused by a novel coronavirus (SARS-CoV). First reported in Asia, SARS quickly spread worldwide through international travelling. As of July 2003, the World Health Organization reported a total of 8,437 people afflicted with SARS with a 9.6% mortality rate. Although immunopathological damages may account for the severity of respiratory distress, little is known about how the genome-wide gene expression of the host changes under the attack of SARS-CoV.
Based on changes in gene expression of peripheral blood, we identified 52 signature genes that accurately discriminated acute SARS patients from non-SARS controls. While a general suppression of gene expression predominated in SARS-infected blood, several genes including those involved in innate immunity, such as defensins and eosinophil-derived neurotoxin, were upregulated. Instead of employing clustering methods, we ranked the severity of recovering SARS patients by generalized associate plots (GAP) according to the expression profiles of 52 signature genes. Through this method, we discovered a smooth transition pattern of severity from normal controls to acute SARS patients. The rank of SARS severity was significantly correlated with the recovery period (in days) and with the clinical pulmonary infection score.
The use of the GAP approach has proved useful in analyzing the complexity and continuity of biological systems. The severity rank derived from the global expression profile of significantly regulated genes in patients may be useful for further elucidating the pathophysiology of their disease.
PMCID: PMC1262710  PMID: 16174304
23.  Optimization of probe length and the number of probes per gene for optimal microarray analysis of gene expression 
Nucleic Acids Research  2004;32(12):e99.
Gene-specific oligonucleotide probes are currently used in microarrays to avoid cross-hybridization of highly similar sequences. We developed an approach to determine the optimal number and length of gene-specific probes for accurate transcriptional profiling studies. The study surveyed probe lengths from 25 to 1000 nt. Long probes yield better signal intensity than short probes. The signal intensity of short probes can be improved by addition of spacers or using higher probe concentration for spotting. We also found that accurate gene expression measurement can be achieved with multiple probes per gene and fewer probes are needed if longer probes rather than shorter probes are used. Based on theoretical considerations that were confirmed experimentally, our results showed that 150mer is the optimal probe length for expression measurement. Gene-specific probes can be identified using a computational approach for 150mer probes and they can be treated like long cDNA probes in terms of the hybridization reaction for high sensitivity detection. Our experimental data also show that probes which do not generate good signal intensity give erroneous expression ratio measurement results. To use microarray probes without experimental validation, gene-specific probes ∼150mer in length are necessary. However, shorter oligonucleotide probes also work well in gene expression analysis if the probes are validated by experimental selection or if multiple probes per gene are used for expression measurement.
PMCID: PMC484198  PMID: 15243142

Results 1-23 (23)