Search tips
Search criteria

Results 1-12 (12)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
1.  ARYANA: Aligning Reads by Yet Another Approach 
BMC Bioinformatics  2014;15(Suppl 9):S12.
Although there are many different algorithms and software tools for aligning sequencing reads, fast gapped sequence search is far from solved. Strong interest in fast alignment is best reflected in the $106 prize for the Innocentive competition on aligning a collection of reads to a given database of reference genomes. In addition, de novo assembly of next-generation sequencing long reads requires fast overlap-layout-concensus algorithms which depend on fast and accurate alignment.
We introduce ARYANA, a fast gapped read aligner, developed on the base of BWA indexing infrastructure with a completely new alignment engine that makes it significantly faster than three other aligners: Bowtie2, BWA and SeqAlto, with comparable generality and accuracy. Instead of the time-consuming backtracking procedures for handling mismatches, ARYANA comes with the seed-and-extend algorithmic framework and a significantly improved efficiency by integrating novel algorithmic techniques including dynamic seed selection, bidirectional seed extension, reset-free hash tables, and gap-filling dynamic programming. As the read length increases ARYANA's superiority in terms of speed and alignment rate becomes more evident. This is in perfect harmony with the read length trend as the sequencing technologies evolve. The algorithmic platform of ARYANA makes it easy to develop mission-specific aligners for other applications using ARYANA engine.
ARYANA with complete source code can be obtained from
PMCID: PMC4168712  PMID: 25252881
Alignment; Read mapping; DNA sequencing; Next generation sequencing (NGS)
2.  Analysis of Candidate Genes Has Proposed the Role of Y Chromosome in Human Prostate Cancer 
Prostate cancer, a serious genetic disease, has known as the first widespread cancer in men, but the molecular changes required for the cancer progression has not fully understood. Availability of high-throughput gene expression data has led to the development of various computational methods, for identification of the critical genes, have involved in the cancer.
In this paper, we have shown the construction of co-expression networks, which have been using Y-chromosome genes, provided an alternative strategy for detecting of new candidate, might involve in prostate cancer. In our approach, we have constructed independent co-expression networks from normal and cancerous stages have been using a reverse engineering approach. Then we have highlighted crucial Y chromosome genes involved in the prostate cancer, by analyzing networks, based on party and date hubs.
Our results have led to the detection of 19 critical genes, related to prostate cancer, which 12 of them have previously shown to be involved in this cancer. Also, essential Y chromosome genes have searched based on reconstruction of sub-networks which have led to the identification of 4 experimentally established as well as 4 new Y chromosome genes might be linked putatively to prostate cancer.
Correct inference of master genes, which mediate molecular, has changed during cancer progression would be one of the major challenges in cancer genomics. In this paper, we have shown the role of Y chromosome genes in finding of the prostate cancer susceptibility genes. Application of our approach to the prostate cancer has led to the establishment of the previous knowledge about this cancer as well as prediction of other new genes.
PMCID: PMC4307103  PMID: 25628841
Co-expression networks; expression data; prostate cancer; reverse engineering approach
3.  Natural Biased Coin Encoded in the Genome Determines Cell Strategy 
PLoS ONE  2014;9(8):e103569.
Decision making at a cellular level determines different fates for isogenic cells. However, it is not yet clear how rational decisions are encoded in the genome, how they are transmitted to their offspring, and whether they evolve and become optimized throughout generations. In this paper, we use a game theoretic approach to explain how rational decisions are made in the presence of cooperators and competitors. Our results suggest the existence of an internal switch that operates as a biased coin. The biased coin is, in fact, a biochemical bistable network of interacting genes that can flip to one of its stable states in response to different environmental stimuli. We present a framework to describe how the positions of attractors in such a gene regulatory network correspond to the behavior of a rational player in a competing environment. We evaluate our model by considering lysis/lysogeny decision making of bacteriophage lambda in E. coli.
PMCID: PMC4121144  PMID: 25090629
4.  Dependency of codon usage on protein sequence patterns: a statistical study 
Codon degeneracy and codon usage by organisms is an interesting and challenging problem. Researchers demonstrated the relation between codon usage and various functions or properties of genes and proteins, such as gene regulation, translation rate, translation efficiency, mRNA stability, splicing, and protein domains. Researchers usually represent segments of proteins responsible for specific functions or structures in a family of proteins as sequence patterns or motifs. We asked the question if organisms use the same codons in pattern segments as compared to the rest of the sequence.
We used the likelihood ratio test, Pearson’s chi-squared test, and mutual information to compare these two codon usages.
We showed that codon usage, in segments of genes that code for a given pattern or motif in a group of proteins, varied from the rest of the gene. The codon usage in these segments was not random. Amino acids with larger number of codons used more specific codon ratios in these segments. We studied the number of amino acids in the pattern (pattern length). As patterns got longer, there was a slight decrease in the fraction of patterns with significant different codon usage in the pattern region as compared to codon usage in the gene region. We defined a measure of specificity of protein patterns, and studied its relation to the codon usage. The difference in the codon usage between pattern region and gene region, was less for the patterns with higher specificity.
We provided a hypothesis that there are segments on genes that affect the codon usage and thus influence protein translation speed, and these regions are the regions that code protein pattern regions.
PMCID: PMC3896713  PMID: 24410898
Codon usage; Sequence analysis; Protein pattern; Pearson’s chi-squared test; Likelihood ratio test
5.  In silico experiment with an-antigen-toll like receptor-5 agonist fusion construct for immunogenic application to Helicobacter pylori 
Helicobacter pylori colonize the gastric mucosa of half of the world's population. Although it is classified as a definitive type I carcinogen by World Health Organization, there is no effective vaccine against this bacterium. H. pylori evade the host immune response by avoiding toll-like detection, such as detection via toll-like receptor-5 (TLR-5). Thus, a chimeric construct consisting of selected epitopes from virulence factors that is incorporated into a TLR-5 ligand (Pseudomonas flagellin) could result in more potent innate and adaptive immune responses.
Based on the histocompatibility antigens of BALB/c mice, in silico techniques were used to select several fragments from H. pylori virulence factors with a high density of B- and T-cell epitopes.
These segments consist of cytotoxin-associated geneA (residue 162-283), neutrophil activating protein (residue 30-135) and outer inflammatory protein A (residue 155-268). The secondary and tertiary structure of the chimeric constructs and other bioinformatics analyses such as stability, solubility, and antigenicity were performed. The chimeric construct containing antigenic segments of H. pylori proteins was fused with the D3 domain of Pseudomonas flagellin. This recombinant chimeric gene was optimized for expression in Escherichia coli. The in silico results showed that the conserved C- and N-terminal domains of flagellin and the antigenicity of selected fragments were retained.
In silico analysis showed that Pseudomonas flagellin is a suitable platform for incorporation of an antigenic construct from H. pylori. This strategy may be an effective tool for the control of H. pylori and other persistent infections.
PMCID: PMC3722629  PMID: 23901192
Cytotoxin-associated gene A; Helicobacter pylori; multi-epitope vaccine; neutrophil activating protein; outer inflammatory protein A
6.  Features analysis for identification of date and party hubs in protein interaction network of Saccharomyces Cerevisiae 
BMC Systems Biology  2010;4:172.
It has been understood that biological networks have modular organizations which are the sources of their observed complexity. Analysis of networks and motifs has shown that two types of hubs, party hubs and date hubs, are responsible for this complexity. Party hubs are local coordinators because of their high co-expressions with their partners, whereas date hubs display low co-expressions and are assumed as global connectors. However there is no mutual agreement on these concepts in related literature with different studies reporting their results on different data sets. We investigated whether there is a relation between the biological features of Saccharomyces Cerevisiae's proteins and their roles as non-hubs, intermediately connected, party hubs, and date hubs. We propose a classifier that separates these four classes.
We extracted different biological characteristics including amino acid sequences, domain contents, repeated domains, functional categories, biological processes, cellular compartments, disordered regions, and position specific scoring matrix from various sources. Several classifiers are examined and the best feature-sets based on average correct classification rate and correlation coefficients of the results are selected. We show that fusion of five feature-sets including domains, Position Specific Scoring Matrix-400, cellular compartments level one, and composition pairs with two and one gaps provide the best discrimination with an average correct classification rate of 77%.
We study a variety of known biological feature-sets of the proteins and show that there is a relation between domains, Position Specific Scoring Matrix-400, cellular compartments level one, composition pairs with two and one gaps of Saccharomyces Cerevisiae's proteins, and their roles in the protein interaction network as non-hubs, intermediately connected, party hubs and date hubs. This study also confirms the possibility of predicting non-hubs, party hubs and date hubs based on their biological features with acceptable accuracy. If such a hypothesis is correct for other species as well, similar methods can be applied to predict the roles of proteins in those species.
PMCID: PMC3018396  PMID: 21167069
7.  Construction of random perfect phylogeny matrix 
Interest in developing methods appropriate for mapping increasing amounts of genome-wide molecular data are increasing rapidly. There is also an increasing need for methods that are able to efficiently simulate such data.
Patients and methods
In this article, we provide a graph-theory approach to find the necessary and sufficient conditions for the existence of a phylogeny matrix with k nonidentical haplotypes, n single nucleotide polymorphisms (SNPs), and a population size of m for which the minimum allele frequency of each SNP is between two specific numbers a and b.
We introduce an O(max(n2, nm)) algorithm for the random construction of such a phylogeny matrix. The running time of any algorithm for solving this problem would be Ω (nm).
We have developed software, RAPPER, based on this algorithm, which is available at
PMCID: PMC3170006  PMID: 21918630
perfect phylogeny; minimum allele frequency (MAF); tree; recursive algorithm
8.  A pairwise residue contact area-based mean force potential for discrimination of native protein structure 
BMC Bioinformatics  2010;11:16.
Considering energy function to detect a correct protein fold from incorrect ones is very important for protein structure prediction and protein folding. Knowledge-based mean force potentials are certainly the most popular type of interaction function for protein threading. They are derived from statistical analyses of interacting groups in experimentally determined protein structures. These potentials are developed at the atom or the amino acid level. Based on orientation dependent contact area, a new type of knowledge-based mean force potential has been developed.
We developed a new approach to calculate a knowledge-based potential of mean-force, using pairwise residue contact area. To test the performance of our approach, we performed it on several decoy sets to measure its ability to discriminate native structure from decoys. This potential has been able to distinguish native structures from the decoys in the most cases. Further, the calculated Z-scores were quite high for all protein datasets.
This knowledge-based potential of mean force can be used in protein structure prediction, fold recognition, comparative modelling and molecular recognition. The program is available at
PMCID: PMC2821318  PMID: 20064218
9.  Global haplotype partitioning for maximal associated SNP pairs 
BMC Bioinformatics  2009;10:269.
Global partitioning based on pairwise associations of SNPs has not previously been used to define haplotype blocks within genomes. Here, we define an association index based on LD between SNP pairs. We use the Fisher's exact test to assess the statistical significance of the LD estimator. By this test, each SNP pair is characterized as associated, independent, or not-statistically-significant. We set limits on the maximum acceptable proportion of independent pairs within all blocks and search for the partitioning with maximal proportion of associated SNP pairs. Essentially, this model is reduced to a constrained optimization problem, the solution of which is obtained by iterating a dynamic programming algorithm.
In comparison with other methods, our algorithm reports blocks of larger average size. Nevertheless, the haplotype diversity within the blocks is captured by a small number of tagSNPs. Resampling HapMap haplotypes under a block-based model of recombination showed that our algorithm is robust in reproducing the same partitioning for recombinant samples. Our algorithm performed better than previously reported models in a case-control association study aimed at mapping a single locus trait, based on simulation results that were evaluated by a block-based statistical test. Compared to methods of haplotype block partitioning, we performed best on detection of recombination hotspots.
Our proposed method divides chromosomes into the regions within which allelic associations of SNP pairs are maximized. This approach presents a native design for dimension reduction in genome-wide association studies. Our results show that the pairwise allelic association of SNPs can describe various features of genomic variation, in particular recombination hotspots.
PMCID: PMC2749056  PMID: 19712447
10.  Impact of residue accessible surface area on the prediction of protein secondary structures 
BMC Bioinformatics  2008;9:357.
The problem of accurate prediction of protein secondary structure continues to be one of the challenging problems in Bioinformatics. It has been previously suggested that amino acid relative solvent accessibility (RSA) might be an effective factor for increasing the accuracy of protein secondary structure prediction. Previous studies have either used a single constant threshold to classify residues into discrete classes (buries vs. exposed), or used the real-value predicted RSAs in their prediction method.
We studied the effect of applying different RSA threshold types (namely, fixed thresholds vs. residue-dependent thresholds) on a variety of secondary structure prediction methods. With the consideration of DSSP-assigned RSA values we realized that improvement in the accuracy of prediction strictly depends on the selected threshold(s). Furthermore, we showed that choosing a single threshold for all amino acids is not the best possible parameter. We therefore used residue-dependent thresholds and most of residues showed improvement in prediction. Next, we tried to consider predicted RSA values, since in the real-world problem, protein sequence is the only available information. We first predicted the RSA classes by RVP-net program and then used these data in our method. Using this approach, improvement in prediction was also obtained.
The success of applying the RSA information on different secondary structure prediction methods suggest that prediction accuracy can be improved independent of prediction approaches. Thus, solvent accessibility can be considered as a rich source of information to help the improvement of these methods.
PMCID: PMC2553345  PMID: 18759992
11.  A tale of two symmetrical tails: Structural and functional characteristics of palindromes in proteins 
BMC Bioinformatics  2008;9:274.
It has been previously shown that palindromic sequences are frequently observed in proteins. However, our knowledge about their evolutionary origin and their possible importance is incomplete.
In this work, we tried to revisit this relatively neglected phenomenon. Several questions are addressed in this work. (1) It is known that there is a large chance of finding a palindrome in low complexity sequences (i.e. sequences with extreme amino acid usage bias). What is the role of sequence complexity in the evolution of palindromic sequences in proteins? (2) Do palindromes coincide with conserved protein sequences? If yes, what are the functions of these conserved segments? (3) In case of conserved palindromes, is it always the case that the whole conserved pattern is also symmetrical? (4) Do palindromic protein sequences form regular secondary structures? (5) Does sequence similarity of the two "sides" of a palindrome imply structural similarity? For the first question, we showed that the complexity of palindromic peptides is significantly lower than randomly generated palindromes. Therefore, one can say that palindromes occur frequently in low complexity protein segments, without necessarily having a defined function or forming a special structure. Nevertheless, this does not rule out the possibility of finding palindromes which play some roles in protein structure and function. In fact, we found several palindromes that overlap with conserved protein Blocks of different functions. However, in many cases we failed to find any symmetry in the conserved regions of corresponding Blocks. Furthermore, to answer the last two questions, the structural characteristics of palindromes were studied. It is shown that palindromes may have a great propensity to form α-helical structures. Finally, we demonstrated that the two sides of a palindrome generally do not show significant structural similarities.
We suggest that the puzzling abundance of palindromic sequences in proteins is mainly due to their frequent concurrence with low-complexity protein regions, rather than a global role in the protein function. In addition, palindromic sequences show a relatively high tendency to form helices, which might play an important role in the evolution of proteins that contain palindromes. Moreover, reverse similarity in peptides does not necessarily imply significant structural similarity. This observation rules out the importance of palindromes for forming symmetrical structures. Although palindromes frequently overlap with conserved Blocks, we suggest that palindromes overlap with Blocks only by coincidence, rather than being involved with a certain structural fold or protein domain.
PMCID: PMC2474621  PMID: 18547401
12.  Impact of RNA structure on the prediction of donor and acceptor splice sites 
BMC Bioinformatics  2006;7:297.
gene identification in genomic DNA sequences by computational methods has become an important task in bioinformatics and computational gene prediction tools are now essential components of every genome sequencing project. Prediction of splice sites is a key step of all gene structural prediction algorithms.
we sought the role of mRNA secondary structures and their information contents for five vertebrate and plant splice site datasets. We selected 900-nucleotide sequences centered at each (real or decoy) donor and acceptor sites, and predicted their corresponding RNA structures by Vienna software. Then, based on whether the nucleotide is in a stem or not, the conventional four-letter nucleotide alphabet was translated into an eight-letter alphabet. Zero-, first- and second-order Markov models were selected as the signal detection methods. It is shown that applying the eight-letter alphabet compared to the four-letter alphabet considerably increases the accuracy of both donor and acceptor site predictions in case of higher order Markov models.
Our results imply that RNA structure contains important data and future gene prediction programs can take advantage of such information.
PMCID: PMC1526458  PMID: 16772025

Results 1-12 (12)