The development of DNA microarray makes researchers screen thousands of genes simultaneously and it also helps determine high- and low-expression level genes in normal and disease tissues. Selecting relevant genes for cancer classification is an important issue. Most of the gene selection methods use univariate ranking criteria and arbitrarily choose a threshold to choose genes. However, the parameter setting may not be compatible to the selected classification algorithms. In this paper, we propose a new gene selection method (SVM-t) based on the use of t-statistics embedded in support vector machine. We compared the performance to two similar SVM-based methods: SVM recursive feature elimination (SVMRFE) and recursive support vector machine (RSVM). The three methods were compared based on extensive simulation experiments and analyses of two published microarray datasets. In the simulation experiments, we found that the proposed method is more robust in selecting informative genes than SVMRFE and RSVM and capable to attain good classification performance when the variations of informative and noninformative genes are different. In the analysis of two microarray datasets, the proposed method yields better performance in identifying fewer genes with good prediction accuracy, compared to SVMRFE and RSVM.
Ancestry informative markers (AIMs) are a type of genetic marker that is informative for tracing the ancestral ethnicity of individuals. Application of AIMs has gained substantial attention in population genetics, forensic sciences, and medical genetics. Single nucleotide polymorphisms (SNPs), the materials of AIMs, are useful for classifying individuals from distinct continental origins but cannot discriminate individuals with subtle genetic differences from closely related ancestral lineages. Proof-of-principle studies have shown that gene expression (GE) also is a heritable human variation that exhibits differential intensity distributions among ethnic groups. GE supplies ethnic information supplemental to SNPs; this motivated us to integrate SNP and GE markers to construct AIM panels with a reduced number of required markers and provide high accuracy in ancestry inference. Few studies in the literature have considered GE in this aspect, and none have integrated SNP and GE markers to aid classification of samples from closely related ethnic populations.
We integrated a forward variable selection procedure into flexible discriminant analysis to identify key SNP and/or GE markers with the highest cross-validation prediction accuracy. By analyzing genome-wide SNP and/or GE markers in 210 independent samples from four ethnic groups in the HapMap II Project, we found that average testing accuracies for a majority of classification analyses were quite high, except for SNP-only analyses that were performed to discern study samples containing individuals from two close Asian populations. The average testing accuracies ranged from 0.53 to 0.79 for SNP-only analyses and increased to around 0.90 when GE markers were integrated together with SNP markers for the classification of samples from closely related Asian populations. Compared to GE-only analyses, integrative analyses of SNP and GE markers showed comparable testing accuracies and a reduced number of selected markers in AIM panels.
Integrative analysis of SNP and GE markers provides high-accuracy and/or cost-effective classification results for assigning samples from closely related or distantly related ancestral lineages to their original ancestral populations. User-friendly BIASLESS (Biomarkers Identification and Samples Subdivision) software was developed as an efficient tool for selecting key SNP and/or GE markers and then building models for sample subdivision. BIASLESS was programmed in R and R-GUI and is available online at http://www.stat.sinica.edu.tw/hsinchou/genetics/prediction/BIASLESS.htm.
Single nucleotide polymorphism (SNP); Allele frequency; Gene expression; HapMap; Classification analysis; Ancestry informative marker (AIM)
The direct sequencing of PCR products generates heterozygous base-calling fluorescence chromatograms that are useful for identifying single-nucleotide polymorphisms (SNPs), insertion-deletions (indels), short tandem repeats (STRs), and paralogous genes. Indels and STRs can be easily detected using the currently available Indelligent or ShiftDetector programs, which do not search reference sequences. However, the detection of other genomic variants remains a challenge due to the lack of appropriate tools for heterozygous base-calling fluorescence chromatogram data analysis. In this study, we developed a free web-based program, Mixed Sequence Reader (MSR), which can directly analyze heterozygous base-calling fluorescence chromatogram data in .abi file format using comparisons with reference sequences. The heterozygous sequences are identified as two distinct sequences and aligned with reference sequences. Our results showed that MSR may be used to (i) physically locate indel and STR sequences and determine STR copy number by searching NCBI reference sequences; (ii) predict combinations of microsatellite patterns using the Federal Bureau of Investigation Combined DNA Index System (CODIS); (iii) determine human papilloma virus (HPV) genotypes by searching current viral databases in cases of double infections; (iv) estimate the copy number of paralogous genes, such as β-defensin 4 (DEFB4) and its paralog HSPDP3.
Housekeeping (HK) genes fulfill the basic needs for a cell to survive and function properly. Their ubiquitous expression, originally thought to be constant, can vary from tissue to tissue, but this variation remains largely uncharacterized and it could not be explained by previously identified properties of HK genes such as short gene length and high GC content. By analyzing microarray expression data for human genes, we uncovered a previously unnoted characteristic of HK gene expression, namely that the ranking order of their expression levels tends to be preserved from one tissue to another. Further analysis by tensor product decomposition and pathway stratification identified three main factors of the observed ranking preservation, namely that, compared to those of non-HK (NHK) genes, the expression levels of HK genes show a greater degree of dispersion (less overlap), stableness (a smaller variation in expression between tissues), and correlation of expression. Our results shed light on regulatory mechanisms of HK gene expression that are probably different for different HK genes or pathways, but are consistent and coordinated in different tissues.
Meiotic recombination ensures that each child inherits distinct genetic materials from each parent, but the distribution of crossovers along meiotic chromosomes remains difficult to identify. In this study, we developed a parent-sibling tracing (PST) approach from previously reported methods to identify meiotic crossover sites of GEO GSE6754 data set. This approach requires only the single nucleotide polymorphism (SNP) data of the pedigrees of both parents and at least two of children.
Compared to other SNP-based algorithms (identity by descent or pediSNP), fewer uninformative SNPs were derived with the use of PST. Analysis of a GEO GSE6754 data set containing 2,145 maternal and paternal meiotic events revealed that the pattern and distribution of paternal and maternal recombination sites vary along the chromosomes. Lower crossover rates near the centromeres were more prominent in males than in females. Based on analysis of repetitive sequences, we also showed that recombination hotspots are positively correlated with SINE/MIR repetitive elements and negatively correlated with LINE/L1 elements. The number of meiotic recombination events was positively correlated with the number of shorter tandem repeat sequences.
The advantages of the PST approach include the ability to use only two-generation pedigrees with two siblings and the ability to perform gender-specific analyses of repetitive elements and tandem repeat sequences while including fewer uninformative SNP regions in the results.
Since brain tissue is not readily accessible, a new focus in search of biomarkers for schizophrenia is blood-based expression profiling of non-protein coding genes such as microRNAs (miRNAs), which regulate gene expression by inhibiting the translation of messenger RNAs. This study aimed to identify potential miRNA signature for schizophrenia by comparing genome-wide miRNA expression profiles in patients with schizophrenia vs. healthy controls. A genome-wide miRNA expression profiling was performed using a Taqman array of 365 human miRNAs in the mononuclear leukocytes of a learning set of 30 cases and 30 controls. The discriminating performance of potential biomarkers was validated in an independent testing set of 60 cases and 30 controls. The expression levels of the miRNA signature were then evaluated for their correlation with the patients' clinical symptoms, neurocognitive performances, and neurophysiological functions. A seven-miRNA signature (hsa-miR-34a, miR-449a, miR-564, miR-432, miR-548d, miR-572 and miR-652) was derived from a supervised classification with internal cross-validation, with an area under the curve (AUC) of receiver operating characteristics of 93%. The putative signature was then validated in the testing set, with an AUC of 85%. Among these miRNAs, miR-34a was differentially expressed between cases and controls in both the learning (P = 0.005) and the testing set (P = 0.002). These miRNAs were differentially correlated with patients' negative symptoms, neurocognitive performance scores, and event-related potentials. The results indicated that the mononuclear leukocyte-based miRNA profiling is a feasible way to identify biomarkers for schizophrenia, and the seven-miRNA signature warrants further investigation.
Genome-wide single-nucleotide polymorphism (SNP) arrays containing hundreds of thousands of SNPs from the human genome have proven useful for studying important human genome questions. Data quality of SNP arrays plays a key role in the accuracy and precision of downstream data analyses. However, good indices for assessing data quality of SNP arrays have not yet been developed.
We developed new quality indices to measure the quality of SNP arrays and/or DNA samples and investigated their statistical properties. The indices quantify a departure of estimated individual-level allele frequencies (AFs) from expected frequencies via standardized distances. The proposed quality indices followed lognormal distributions in several large genomic studies that we empirically evaluated. AF reference data and quality index reference data for different SNP array platforms were established based on samples from various reference populations. Furthermore, a confidence interval method based on the underlying empirical distributions of quality indices was developed to identify poor-quality SNP arrays and/or DNA samples. Analyses of authentic biological data and simulated data show that this new method is sensitive and specific for the detection of poor-quality SNP arrays and/or DNA samples.
This study introduces new quality indices, establishes references for AFs and quality indices, and develops a detection method for poor-quality SNP arrays and/or DNA samples. We have developed a new computer program that utilizes these methods called SNP Array Quality Control (SAQC). SAQC software is written in R and R-GUI and was developed as a user-friendly tool for the visualization and evaluation of data quality of genome-wide SNP arrays. The program is available online (http://www.stat.sinica.edu.tw/hsinchou/genetics/quality/SAQC.htm).
Pulsed-field gel electrophoresis (PFGE) is a standard typing method for isolates from Salmonella outbreaks and epidemiological investigations. Eight hundred sixty-six Salmonella enterica isolates from eight serotypes, including Heidelberg (n = 323), Javiana (n = 200), Typhimurium (n = 163), Newport (n = 93), Enteritidis (n = 45), Dublin (n = 25), Pullorum (n = 9), and Choleraesuis (n = 8), were subjected to PFGE, and their profiles were analyzed by random forest classification and compared to conventional hierarchical cluster analysis to determine potential predictive relationships between PFGE banding patterns and particular serotypes. Cluster analysis displayed only the underlying similarities and relationships of the isolates from the eight serotypes. However, for serotype prediction of a nonserotyped Salmonella isolate from its PFGE pattern, random forest classification provided better accuracy than conventional cluster analysis. Discriminatory DNA band class markers were identified for distinguishing Salmonella serotype Heidelberg, Javiana, Typhimurium, and Newport isolates.
Gene duplications are scattered widely throughout the human genome. A single-base difference located in nearly identical duplicated segments may be misjudged as a single nucleotide polymorphism (SNP) from individuals. This imperfection is undistinguishable in current genotyping methods. As the next-generation sequencing technologies become more popular for sequence-based association studies, numerous ambiguous SNPs are rapidly accumulated. Thus, analyzing duplication variations in the reference genome to assist in preventing false positive SNPs is imperative. We have identified >10% of human genes associated with duplicated gene loci (DGL). Through meticulous sequence alignments of DGL, we systematically designated 1 236 956 variations as duplicated gene nucleotide variants (DNVs). The DNV database (dbDNV) (http://goods.ibms.sinica.edu.tw/DNVs/) has been established to promote more accurate variation annotation. Aside from the flat file download, users can explore the gene-related duplications and the associated DNVs by DGL and DNV searches, respectively. In addition, the dbDNV contains 304 110 DNV-coupled SNPs. From DNV-coupled SNP search, users observe which SNP records are also variants among duplicates. This is useful while ∼58% of exonic SNPs in DGL are DNV-coupled. Because of high accumulation of ambiguous SNPs, we suggest that annotating SNPs with DNVs possibilities should improve association studies of these variants with human diseases.
The accumulation of complete genomic sequences enhances the need for functional annotation. Associating existing functional annotation of orthologs can speed up the annotation process and even examine the existing annotation. However, current protein sequence-based ortholog databases provide ambiguous and incomplete orthology in eukaryotes. It is because that isoforms, derived by alternative splicing (AS), often share higher sequence similarity to interfere the sequence-based identification. Gene-Oriented Ortholog Database (GOOD) employs genomic locations of transcripts to cluster AS-derived isoforms prior to ortholog delineation to eliminate the interference from AS. From the gene-oriented presentation, isoforms can be clearly associated to their genes to provide comprehensive ortholog information and further be discriminated from paralogs. Aside from, displaying clusters of isoforms between orthologous genes can present the evolution variation at the transcription level. Based on orthology, GOOD additionally comprises functional annotation from the Gene Ontology (GO) database. However, there exist redundant annotations, both parent and child terms assigned to the same gene, in the GO database. It is difficult to precisely draw the numerical comparison of term counts between orthologous genes annotated with redundant terms. Instead of the description only, GOOD further provides the GO graphs to reveal hierarchical-like relationships among divergent functionalities. Therefore, the redundancy of GO terms can be examined, and the context among compared terms is more comprehensive. In sum, GOOD can improve the interpretation in the molecular function from experiments in the model organism and provide clear comparative genomic annotation across organisms.
Database URL: http://goods.ibms.sinica.edu.tw/goods/
Interlaboratory comparison of microarray data, even when using the same platform, imposes several challenges to scientists. RNA quality, RNA labeling efficiency, hybridization procedures and data-mining tools can all contribute variations in each laboratory. In Affymetrix GeneChips, about 11–20 different 25-mer oligonucleotides are used to measure the level of each transcript. Here, we report that ‘labeling extension values (LEVs)’, which are correlation coefficients between probe intensities and probe positions, are highly correlated with the gene expression levels (GEVs) on eukayotic Affymetrix microarray data. By analyzing LEVs and GEVs in the publicly available 2414 cel files of 20 Affymetrix microarray types covering 13 species, we found that correlations between LEVs and GEVs only exist in eukaryotic RNAs, but not in prokaryotic ones. Surprisingly, Affymetrix results of the same specimens that were analyzed in different laboratories could be clearly differentiated only by LEVs, leading to the identification of ‘laboratory signatures’. In the examined dataset, GSE10797, filtering out high-LEV genes did not compromise the discovery of biological processes that are constructed by differentially expressed genes. In conclusion, LEVs provide a new filtering parameter for microarray analysis of gene expression and it may improve the inter- and intralaboratory comparability of Affymetrix GeneChips data.
Echinacea spp. extracts and the derived phytocompounds have been shown to induce specific immune cell activities and are popularly used as food supplements or nutraceuticals for immuno-modulatory functions. Dendritic cells (DCs), the most potent antigen presenting cells, play an important role in both innate and adaptive immunities. In this study, we investigated the specific and differential gene expression in human immature DCs (iDCs) in response to treatment with a butanol fraction containing defined bioactive phytocompounds extracted from stems and leaves of Echinacea purpurea, that we denoted [BF/S+L/Ep].
Affymetrix DNA microarray results showed significant up regulation of specific genes for cytokines (IL-8, IL-1β, and IL-18) and chemokines (CXCL 2, CCL 5, and CCL 2) within 4 h after [BF/S+L/Ep] treatment of iDCs. Bioinformatics analysis of genes expressed in [BF/S+L/Ep]-treated DCs revealed a key-signaling network involving a number of immune-modulatory molecules leading to the activation of a downstream molecule, adenylate cyclase 8. Proteomic analysis showed increased expression of antioxidant and cytoskeletal proteins after treatment with [BF/S+L/Ep] and cichoric acid.
This study provides information on candidate target molecules and molecular signaling mechanisms for future systematic research into the immune-modulatory activities of an important traditional medicinal herb and its derived phytocompounds.
Survival time is an important clinical trait for many disease studies. Previous works have shown certain relationship between patients' gene expression profiles and survival time. However, due to the censoring effects of survival time and the high dimensionality of gene expression data, effective and unbiased selection of a gene expression signature to predict survival probabilities requires further study.
We propose a method for an integrated study of survival time and gene expression. This method can be summarized as a two-step procedure: in the first step, a moderate number of genes are pre-selected using correlation or liquid association (LA). Imputation and transformation methods are employed for the correlation/LA calculation. In the second step, the dimension of the predictors is further reduced using the modified sliced inverse regression for censored data (censorSIR).
The new method is tested via both simulated and real data. For the real data application, we employed a set of 295 breast cancer patients and found a linear combination of 22 gene expression profiles that are significantly correlated with patients' survival rate.
By an appropriate combination of feature selection and dimension reduction, we find a method of identifying gene expression signatures which is effective for survival prediction.
Orthology is a widely used concept in comparative and evolutionary genomics. In addition to prokaryotic orthology, delineating eukaryotic orthology has provided insight into the evolution of higher organisms. Indeed, many eukaryotic ortholog databases have been established for this purpose. However, unlike prokaryotes, alternative splicing (AS) has hampered eukaryotic orthology assignments. Therefore, existing databases likely contain ambiguous eukaryotic ortholog relationships and possibly misclassify alternatively spliced protein isoforms as in-paralogs, which are duplicated genes that arise following speciation. Here, we propose a new approach for designating eukaryotic orthology using processed transcription units, and we present an orthology database prototype using the human and mouse genomes. Currently existing programs cover less than 69% of the human reference sequences when assigning human/mouse orthologs. In contrast, our method encompasses up to 80% of the human reference sequences. Moreover, the ortholog database presented herein is more than 92% consistent with the existing databases. In addition to managing AS, this approach is capable of identifying orthologs of embedded genes and fusion genes using syntenic evidence. In summary, this new approach is sensitive, specific and can generate a more comprehensive and accurate compilation of eukaryotic orthologs.
The hierarchical clustering tree (HCT) with a dendrogram  and the singular value decomposition (SVD) with a dimension-reduced representative map  are popular methods for two-way sorting the gene-by-array matrix map employed in gene expression profiling. While HCT dendrograms tend to optimize local coherent clustering patterns, SVD leading eigenvectors usually identify better global grouping and transitional structures.
This study proposes a flipping mechanism for a conventional agglomerative HCT using a rank-two ellipse (R2E, an improved SVD algorithm for sorting purpose) seriation by Chen  as an external reference. While HCTs always produce permutations with good local behaviour, the rank-two ellipse seriation gives the best global grouping patterns and smooth transitional trends. The resulting algorithm automatically integrates the desirable properties of each method so that users have access to a clustering and visualization environment for gene expression profiles that preserves coherent local clusters and identifies global grouping trends.
We demonstrate, through four examples, that the proposed method not only possesses better numerical and statistical properties, it also provides more meaningful biomedical insights than other sorting algorithms. We suggest that sorted proximity matrices for genes and arrays, in addition to the gene-by-array expression matrix, can greatly aid in the search for comprehensive understanding of gene expression structures. Software for the proposed methods can be obtained at .
A microarray study may select different differentially expressed gene sets because of different selection criteria. For example, the fold-change and p-value are two commonly known criteria to select differentially expressed genes under two experimental conditions. These two selection criteria often result in incompatible selected gene sets. Also, in a two-factor, say, treatment by time experiment, the investigator may be interested in one gene list that responds to both treatment and time effects.
We propose three layer ranking algorithms, point-admissible, line-admissible (convex), and Pareto, to provide a preference gene list from multiple gene lists generated by different ranking criteria. Using the public colon data as an example, the layer ranking algorithms are applied to the three univariate ranking criteria, fold-change, p-value, and frequency of selections by the SVM-RFE classifier. A simulation experiment shows that for experiments with small or moderate sample sizes (less than 20 per group) and detecting a 4-fold change or less, the two-dimensional (p-value and fold-change) convex layer ranking selects differentially expressed genes with generally lower FDR and higher power than the standard p-value ranking. Three applications are presented. The first application illustrates a use of the layer rankings to potentially improve predictive accuracy. The second application illustrates an application to a two-factor experiment involving two dose levels and two time points. The layer rankings are applied to selecting differentially expressed genes relating to the dose and time effects. In the third application, the layer rankings are applied to a benchmark data set consisting of three dilution concentrations to provide a ranking system from a long list of differentially expressed genes generated from the three dilution concentrations.
The layer ranking algorithms are useful to help investigators in selecting the most promising genes from multiple gene lists generated by different filter, normalization, or analysis methods for various objectives.
Most virus detection methods are geared towards the detection of specific single viruses or just a few known targets, and lack the capability to uncover the novel viruses that cause emerging viral infections. To address this issue, we developed a computational method that identifies the conserved viral sequences at the genus level for all viral genomes available in GenBank, and established a virus probe library. The virus probes are used not only to identify known viruses but also for discerning the genera of emerging or uncharacterized ones.
Using the microarray approach, the identity of the virus in a test sample is determined by the signals of both genus and species-specific probes. The genera of emerging and uncharacterized viruses are determined based on hybridization of the viral sequences to the conserved probes for the existing viral genera. A detection and classification procedure to determine the identity of a virus directly from detection signals results in the rapid identification of the virus.
We have demonstrated the validity and feasibility of the above strategy with a small number of viral samples. The probe design algorithm can be applied to any publicly available viral sequence database. The strategy of using separate genus and species probe sets enables the use of a straightforward virus identity calculation directly based on the hybridization signals. Our virus identification strategy has great potential in the diagnosis of viral infections. The virus genus and specific probe database and the associated summary tables are available at
Severe acute respiratory syndrome (SARS), a recent epidemic human disease, is caused by a novel coronavirus (SARS-CoV). First reported in Asia, SARS quickly spread worldwide through international travelling. As of July 2003, the World Health Organization reported a total of 8,437 people afflicted with SARS with a 9.6% mortality rate. Although immunopathological damages may account for the severity of respiratory distress, little is known about how the genome-wide gene expression of the host changes under the attack of SARS-CoV.
Based on changes in gene expression of peripheral blood, we identified 52 signature genes that accurately discriminated acute SARS patients from non-SARS controls. While a general suppression of gene expression predominated in SARS-infected blood, several genes including those involved in innate immunity, such as defensins and eosinophil-derived neurotoxin, were upregulated. Instead of employing clustering methods, we ranked the severity of recovering SARS patients by generalized associate plots (GAP) according to the expression profiles of 52 signature genes. Through this method, we discovered a smooth transition pattern of severity from normal controls to acute SARS patients. The rank of SARS severity was significantly correlated with the recovery period (in days) and with the clinical pulmonary infection score.
The use of the GAP approach has proved useful in analyzing the complexity and continuity of biological systems. The severity rank derived from the global expression profile of significantly regulated genes in patients may be useful for further elucidating the pathophysiology of their disease.
Gene-specific oligonucleotide probes are currently used in microarrays to avoid cross-hybridization of highly similar sequences. We developed an approach to determine the optimal number and length of gene-specific probes for accurate transcriptional profiling studies. The study surveyed probe lengths from 25 to 1000 nt. Long probes yield better signal intensity than short probes. The signal intensity of short probes can be improved by addition of spacers or using higher probe concentration for spotting. We also found that accurate gene expression measurement can be achieved with multiple probes per gene and fewer probes are needed if longer probes rather than shorter probes are used. Based on theoretical considerations that were confirmed experimentally, our results showed that 150mer is the optimal probe length for expression measurement. Gene-specific probes can be identified using a computational approach for 150mer probes and they can be treated like long cDNA probes in terms of the hybridization reaction for high sensitivity detection. Our experimental data also show that probes which do not generate good signal intensity give erroneous expression ratio measurement results. To use microarray probes without experimental validation, gene-specific probes ∼150mer in length are necessary. However, shorter oligonucleotide probes also work well in gene expression analysis if the probes are validated by experimental selection or if multiple probes per gene are used for expression measurement.