Many functional proteins have a symmetric structure. Most of these are multimeric complexes, which are made of non-symmetric monomers arranged in a symmetric manner. However, there are also a large number of proteins that have a symmetric structure in the monomeric state. These internally symmetric proteins are interesting objects from the point of view of their folding, function, and evolution. Most algorithms that detect the internally symmetric proteins depend on finding repeating units of similar structure and do not use the symmetry information.
We describe a new method, called SymD, for detecting symmetric protein structures. The SymD procedure works by comparing the structure to its own copy after the copy is circularly permuted by all possible number of residues. The procedure is relatively insensitive to symmetry-breaking insertions and deletions and amplifies positive signals from symmetry. It finds 70% to 80% of the TIM barrel fold domains in the ASTRAL 40 domain database and 100% of the beta-propellers as symmetric. More globally, 10% to 15% of the proteins in the ASTRAL 40 domain database may be considered symmetric according to this procedure depending on the precise cutoff value used to measure the degree of perfection of the symmetry. Symmetrical proteins occur in all structural classes and can have a closed, circular structure, a cylindrical barrel-like structure, or an open, helical structure.
SymD is a sensitive procedure for detecting internally symmetric protein structures. Using this procedure, we estimate that 10% to 15% of the known protein domains may be considered symmetric. We also report an initial, overall view of the types of symmetries and symmetric folds that occur in the protein domain structure universe.
The database of Alignable Tight Genomic Clusters (ATGCs) consists of closely related genomes of archaea and bacteria, and is a resource for research into prokaryotic microevolution. Construction of a data set with appropriate characteristics is a major hurdle for this type of studies. With the current rate of genome sequencing, it is difficult to follow the progress of the field and to determine which of the available genome sets meet the requirements of a given research project, in particular, with respect to the minimum and maximum levels of similarity between the included genomes. Additionally, extraction of specific content, such as genomic alignments or families of orthologs, from a selected set of genomes is a complicated and time-consuming process. The database addresses these problems by providing an intuitive and efficient web interface to browse precomputed ATGCs, select appropriate ones and access ATGC-derived data such as multiple alignments of orthologous proteins, matrices of pairwise intergenomic distances based on genome-wide analysis of synonymous and nonsynonymous substitution rates and others. The ATGC database will be regularly updated following new releases of the NCBI RefSeq. The database is hosted by the Genomics Division at Lawrence Berkeley National laboratory and is publicly available at http://atgc.lbl.gov
Sinorhizobium meliloti strain 1021, a nitrogen-fixing, root-nodulating bacterial microsymbiont of alfalfa, has a 3.5 Mbp circular chromosome and two megaplasmids including 1.3 Mbp pSymA carrying nonessential ‘accessory’ genes for nitrogen fixation (nif), nodulation and host specificity (nod). A related bacterium, psyllid-vectored ‘Ca. Liberibacter asiaticus,’ is an obligate phytopathogen with a reduced genome that was previously analyzed for genes orthologous to genes on the S. meliloti circular chromosome. In general, proteins encoded by pSymA genes are more similar in sequence alignment to those encoded by S. meliloti chromosomal orthologs than to orthologous proteins encoded by genes carried on the ‘Ca. Liberibacter asiaticus’ genome. Only two ‘Ca. Liberibacter asiaticus’ proteins were identified as having orthologous proteins encoded on pSymA but not also encoded on the chromosome of S. meliloti. These two orthologous gene pairs encode a Na+/K+ antiporter (shared with intracellular pathogens of the family Bartonellacea) and a Co++, Zn++ and Cd++ cation efflux protein that is shared with the phytopathogen Agrobacterium. Another shared protein, a redox-regulated K+ efflux pump may regulate cytoplasmic pH and homeostasis. The pSymA and ‘Ca. Liberibacter asiaticus’ orthologs of the latter protein are more highly similar in amino acid alignment compared with the alignment of the pSymA-encoded protein with its S. meliloti chromosomal homolog. About 182 pSymA encoded proteins have sequence similarity (≤E-10) with ‘Ca. Liberibacter asiaticus’ proteins, often present as multiple orthologs of single ‘Ca. Liberibacter asiaticus’ proteins. These proteins are involved with amino acid uptake, cell surface structure, chaperonins, electron transport, export of bioactive molecules, cellular homeostasis, regulation of gene expression, signal transduction and synthesis of amino acids and metabolic cofactors. The presence of multiple orthologs defies mutational analysis and is consistent with the hypothesis that these proteins may be of particular importance in host/microbe interaction and their duplication likely facilitates their ongoing evolution.
The identification of orthologous genes forms the basis for most comparative genomics studies. Existing approaches either lack functional annotation of the identified orthologous groups, hampering the interpretation of subsequent results, or are manually annotated and thus lag behind the rapid sequencing of new genomes. Here we present the eggNOG database (‘evolutionary genealogy of genes: Non-supervised Orthologous Groups’), which contains orthologous groups constructed from Smith–Waterman alignments through identification of reciprocal best matches and triangular linkage clustering. Applying this procedure to 312 bacterial, 26 archaeal and 35 eukaryotic genomes yielded 43 582 course-grained orthologous groups of which 9724 are extended versions of those from the original COG/KOG database. We also constructed more fine-grained groups for selected subsets of organisms, such as the 19 914 mammalian orthologous groups. We automatically annotated our non-supervised orthologous groups with functional descriptions, which were derived by identifying common denominators for the genes based on their individual textual descriptions, annotated functional categories, and predicted protein domains. The orthologous groups in eggNOG contain 1 241 751 genes and provide at least a broad functional description for 77% of them. Users can query the resource for individual genes via a web interface or download the complete set of orthologous groups at http://eggnog.embl.de.
In Rhizobium meliloti 2011 nodulation genes (nod) required to nodulate specifically alfalfa are located on a pSym megaplasmid. Nod- derivatives carrying large pSym deletions were isolated. By complementation of these strains with in vivo- and in vitro-constructed episomes containing pSym of sequences and introduction of these episomes into Agrobacterium tumefaciens, we show (i) that from a region of pSym of about 360 kilobases, genes required for specific alfalfa nodulation are clustered in a DNA fragment of less than 30 kilobases and (ii) that a nod region located between nifHDK and the common nod genes is absolutely required for alfalfa nodulation and controls the specificity of root hair curling and nodule organogenesis initiation.
In microarray data analysis, hierarchical clustering (HC) is often used to group samples or genes according to their gene expression profiles to study their associations. In a typical HC, nested clustering structures can be quickly identified in a tree. The relationship between objects is lost, however, because clusters rather than individual objects are compared. This results in a tree that is hard to interpret.
This study proposes an ordering method, HC-SYM, which minimizes bilateral symmetric distance of two adjacent clusters in a tree so that similar objects in the clusters are located in the cluster boundaries. The performance of HC-SYM was evaluated by both supervised and unsupervised approaches and compared favourably with other ordering methods.
The intuitive relationship between objects and flexibility of the HC-SYM method can be very helpful in the exploratory analysis of not only microarray data but also similar high-dimensional data.
Rationale: T-bet (TBX21 or T-box 21) is a critical regulator of T-helper 1 lineage commitment and IFN-γ production. Knockout mice lacking T-bet develop airway hyperresponsiveness (AHR) to methacholine, peribronchial eosinophilic and lymphocytic inflammation, and increased type III collagen deposition below the bronchial epithelium basement membrane, reminiscent of both acute and chronic asthma histopathology. Little is known regarding the role of genetic variation surrounding T-bet in the development of human AHR.
Objectives: To assess the relationship between T-bet polymorphisms and asthma-related phenotypes using family-based association.
Methods: Single nucleotide polymorphism discovery was performed by resequencing the T-bet genomic locus in 30 individuals (including 22 patients with asthma). Sixteen variants were genotyped in 580 nuclear families ascertained through offspring with asthma from the Childhood Asthma Management Program clinical trial. Haplotype patterns were determined from this genotype data. Family-based tests of association were performed with asthma, AHR, lung function, total serum immunoglobulin E, and blood eosinophil levels.
Main Results: We identified 24 variants. Evidence of association was observed between c.−7947 and asthma in white families using both additive (p = 0.02) or dominant models (p = 0.006). c.−7947 and three other variants were also associated with AHR (log-methacholine PC20, p = 0.02–0.04). Haplotype analysis suggested that an AHR locus is in linkage disequilibrium with variants in the 3′UTR. Evidence of association of AHR with c.−7947, but not with other 3′UTR SNPs, was replicated in an independent cohort of adult males with AHR.
Conclusions: These data suggest that T-bet variation contributes to airway responsiveness in asthma.
immunoglobulin E; single nucleotide polymorphism; T-box; TBX21
Correct orthology assignment is a critical prerequisite of numerous comparative genomics procedures, such as function prediction, construction of phylogenetic species trees and genome rearrangement analysis. We present an algorithm for the detection of non-orthologs that arise by mistake in current orthology classification methods based on genome-specific best hits, such as the COGs database. The algorithm works with pairwise distance estimates, rather than computationally expensive and error-prone tree-building methods. The accuracy of the algorithm is evaluated through verification of the distribution of predicted cases, case-by-case phylogenetic analysis and comparisons with predictions from other projects using independent methods. Our results show that a very significant fraction of the COG groups include non-orthologs: using conservative parameters, the algorithm detects non-orthology in a third of all COG groups. Consequently, sequence analysis sensitive to correct orthology assignments will greatly benefit from these findings.
Accurate identification of orthologs is crucial for evolutionary studies and
for functional annotation. Several algorithms have been developed for
ortholog delineation, but so far, manually curated genome-scale biological
databases of orthologous genes for algorithm evaluation have been lacking.
We evaluated four popular ortholog prediction algorithms
(MultiParanoid; and OrthoMCL; RBH: Reciprocal
Best Hit; RSD: Reciprocal Smallest Distance; the last two extended into
clustering algorithms cRBH and cRSD, respectively, so that
they can predict orthologs across multiple taxa) against a set of 2,723
groups of high-quality curated orthologs from 6 Saccharomycete yeasts in the
Yeast Gene Order Browser.
Examination of sensitivity [TP/(TP+FN)],
specificity [TN/(TN+FP)], and accuracy
[(TP+TN)/(TP+TN+FP+FN)] across a broad
parameter range showed that cRBH was the most accurate and specific
algorithm, whereas OrthoMCL was the most sensitive. Evaluation of
the algorithms across a varying number of species showed that cRBH
had the highest accuracy and lowest false
rate [FP/(FP+TP)], followed by cRSD. Of the
six species in our set, three descended from an ancestor that underwent
whole genome duplication. Subsequent differential duplicate loss events in
the three descendants resulted in distinct classes of gene loss patterns,
including cases where the genes retained in the three descendants are
paralogs, constituting ‘traps’ for ortholog prediction
algorithms. We found that the false
rate of all algorithms dramatically increased in these traps.
These results suggest that simple algorithms, like cRBH, may be
better ortholog predictors than more complex ones (e.g., OrthoMCL
and MultiParanoid) for evolutionary and functional
genomics studies where the objective is the accurate inference of
single-copy orthologs (e.g., molecular phylogenetics), but that all
algorithms fail to accurately predict orthologs when paralogy is
T-bet is a critical transcription factor for T helper-1 (Th1) cell differentiation. To study the regulation and functions of T-bet, we developed a T-bet-ZsGreen reporter mouse strain. We determined that interleukin-12 (IL-12) and interferon-γ (IFN-γ) were redundant in inducing T-bet in mice infected with Toxoplasma gondii and that T-bet did not contribute to its own expression when induced by IL-12 and IFN-γ. By contrast, T-bet and the transcription factor Stat4 were critical for IFN-γ production whereas IFN-γ signaling was dispensable for inducing IFN-γ. Loss of T-bet resulted in activation of an endogenous program driving Th2 cell differentiation in cells expressing T-bet-ZsGreen. Genome-wide analyses indicated that T-bet directly induced many Th1 cell-related genes but indirectly suppressed Th2 cell-related genes. Our study revealed redundancy and synergy among several Th1 cell-inducing pathways in regulating the expression of T-bet and IFN-γ, and a critical role of T-bet in suppressing an endogenous Th2 cell-associated program.
T-bet acts as a functional repressor in association with Bcl-6 to antagonize SOCS1, SOCS3, TCF-1, and late-stage IFN-γ to regulate Th1 development.
The T-box transcription factor T-bet is important for the differentiation of naive CD4+ T helper cells (Th cells) into the Th1 phenotype. Much is known about T-bet’s role as a transcriptional activator, but less is known about the mechanisms by which T-bet functionally represses alternative Th cell genetic programs. In this study, we first identify Socs1, Socs3, and Tcf7 (TCF-1) as gene targets that are negatively regulated by T-bet. Significantly, T-bet’s role in the repression of these genes is through a direct interaction with their promoters. Consistent with this, we identified two T-bet DNA-binding elements in the Socs1 promoter that are functionally used to down-regulate transcription in primary Th1 cells. Importantly, T-bet’s novel role in transcriptional repression is because of its ability to physically associate with, and functionally recruit, the transcriptional repressor Bcl-6 to a subset of promoters. Furthermore, T-bet functionally recruits Bcl-6 to the Ifng locus in late stages of Th1 differentiation to repress its activity, possibly to prevent the overproduction of IFN-γ, which could result in autoimmunity. Collectively, these data establish a novel mechanism for T-bet–mediated gene repression in which two lineage-defining transcription factors, one a classical activator and one a repressor, collaborate to promote and properly regulate Th1 development.
Determining orthology relations among genes across multiple genomes is an important problem in the post-genomic era. Identifying orthologous genes can not only help predict functional annotations for newly sequenced or poorly characterized genomes, but can also help predict new protein–protein interactions. Unfortunately, determining orthology relation through computational methods is not straightforward due to the presence of paralogs. Traditional approaches have relied on pairwise sequence comparisons to construct graphs, which were then partitioned into putative clusters of orthologous groups. These methods do not attempt to preserve the non-transitivity and hierarchic nature of the orthology relation.
We propose a new method, COCO-CL, for hierarchical clustering of homology relations and identification of orthologous groups of genes. Unlike previous approaches, which are based on pairwise sequence comparisons, our method explores the correlation of evolutionary histories of individual genes in a more global context. COCO-CL can be used as a semi-independent method to delineate the orthology/paralogy relation for a refined set of homologous proteins obtained using a less-conservative clustering approach, or as a refiner that removes putative out-paralogs from clusters computed using a more inclusive approach. We analyze our clustering results manually, with support from literature and functional annotations. Since our orthology determination procedure does not employ a species tree to infer duplication events, it can be used in situations when the species tree is unknown or uncertain.
One type of competitive interaction among rhizobia is that between nonnodulating and nodulating strains of Rhizobium leguminosarum on primitive pea genotypes. Pisum sativum cv. Afghanistan nodulates effectively with R. leguminosarum TOM, and this can be blocked in mixed inoculations by R. leguminosarum PF2, which does not nodulate this cultivar. We termed this PF2 phenotype Cnb+, for competitive nodulation blocking. Strain PF2 contains three large plasmids including a 250-kilobase-pair symbiotic (Sym) plasmid. Transfer of this plasmid, pSymPF2, to nonblocking rhizobia conferred the Cnb+ phenotype on recipients in mixed inoculations on cultivar Afghanistan with TOM. A library of the PF2 genome constructed in the vector pMMB33 was used to isolate two cosmid clones which hybridize to pSymPF2. These cosmids, pDD50 and pDD58, overlapped to the extent of 23 kilobase pairs and conferred a Cnb+ phenotype on recipient Cnb- rhizobia, as did pSD1, a subclone from the common region.
OMA is a project that aims to identify orthologs within publicly available, complete genomes. With 657 genomes analyzed to date, OMA is one of the largest projects of its kind.
The algorithm of OMA improves upon standard bidirectional best-hit approach in several respects: it uses evolutionary distances instead of scores, considers distance inference uncertainty, includes many-to-many orthologous relations, and accounts for differential gene losses. Herein, we describe in detail the algorithm for inference of orthology and provide the rationale for parameter selection through multiple tests.
OMA contains several novel improvement ideas for orthology inference and provides a unique dataset of large-scale orthology assignments.
We uncover the global organization of clustering in real complex networks. To this end, we ask whether triangles in real networks organize as in maximally random graphs with given degree and clustering distributions, or as in maximally ordered graph models where triangles are forced into modules. The answer comes by way of exploring m-core landscapes, where the m-core is defined, akin to the k-core, as the maximal subgraph with edges participating in at least m triangles. This property defines a set of nested subgraphs that, contrarily to k-cores, is able to distinguish between hierarchical and modular architectures. We find that the clustering organization in real networks is neither completely random nor ordered although, surprisingly, it is more random than modular. This supports the idea that the structure of real networks may in fact be the outcome of self-organized processes based on local optimization rules, in contrast to global optimization principles.
Best Evidence Topic reports (BETs) summarise the evidence pertaining to particular clinical questions. They are not systematic reviews, but rather contain the best (highest level) evidence that can be practically obtained by busy practicing clinicians. The search strategies used to find the best evidence are reported in detail in order to allow clinicians to update searches whenever necessary. Each BET is based on a clinical scenario and ends with a clinical bottom line, which indicates, in the light of the evidence found, what the reporting clinician would do if faced with the same scenario again.The BETs published below were first reported at the Critical Appraisal Journal Club at the Manchester Royal Infirmary1 or placed on the BestBETs website. Each BET has been constructed in the four stages that have been described elsewhere.2 The BETs shown here together with those published previously and those currently under construction can be seen at http://www.bestbets.org.3 Three BETs are included in this issue of the journal.
Gammahydroxybuyrate overdose and physostigmine
Terlipressin or sclerotherapy for acute variceal bleeding?
Full blood count and reticulocyte count in painful sickle crisis
Carley SD, Mackway‐Jones K, Jones A, et al. Moving towards evidence based emergency medicine: use of a structured critical appraisal journal club. J Accid Emerg Med 1998;15:220–222.
Mackway‐Jones K, Carley SD, Morton RJ, et al. The best evidence topic report: A modified CAT for summarising the available evidence in emergency medicine. J Accid Emerg Med 1998;15:222–226.
Mackway‐Jones K, Carley SD. bestbets.org: Odds on favourite for evidence in emergency medicine reaches the worldwide web. J Accid Emerg Med 2000;17:235–6.
Best evidence topic reports (BETs) summarise the evidence pertaining to particular clinical questions. They are not systematic reviews, but rather contain the best (highest level) evidence that can be practically obtained by busy practising clinicians. The search strategies used to find the best evidence are reported in detail to allow clinicians to update searches whenever necessary. Each BET is based on a clinical scenario and ends with a clinical bottom line that indicates, in light of the evidence found, what the reporting clinician would do if faced with the same scenario again. The BETs published below were first reported at the Critical Appraisal Journal Club at the Manchester Royal Infirmary1 or placed on the BestBETs website. Each BET has been constructed in the four stages that have been described elsewhere.2 The BETs shown here, those published previously and those currently under construction, can be seen at http://www.bestbets.org.3 Three BETs are included in this issue of the journal.
Accuracy of emergency department ultrasound in detecting abdominal aortic aneurysms
Use of aspirin in acute stroke
Myringotomy in traumatic haemotympanum
1 Carley SD, Mackway‐Jones K, Jones A, et al. Moving towards evidence based emergency medicine: use of a structured critical appraisal journal club. J Accid Emerg Med 1998;15:220–2.
2 Mackway‐Jones K, Carley SD, Morton RJ, et al. The best evidence topic report: a modified CAT for summarising the available evidence in emergency medicine. J Accid Emerg Med 1998;15:222–6.
3 Mackway‐Jones K, Carley SD. bestbets.org: odds on favourite for evidence in emergency medicine reaches the worldwide web. J Accid Emerg Med 2000;17:235–6.
Best Evidence Topic reports (BETs) summarise the evidence pertaining to particular clinical questions. They are not systematic reviews, but rather contain the best (highest level) evidence that can be practically obtained by busy practising clinicians. The search strategies used to find the best evidence are reported in detail in order to allow clinicians to update searches whenever necessary. Each BET is based on a clinical scenario and ends with a clinical bottom line which indicates, in the light of the evidence found, what the reporting clinician would do if faced with the same scenario again. The BETs published below were first reported at the Critical Appraisal Journal Club at the Manchester Royal Infirmary1 or placed on the BestBETs website. Each BET has been constructed in the four stages that have been described elsewhere.2 The BETs shown here together with those published previously and those currently under construction can be seen at http://www.bestbets.org.3 Three BETs are included in this issue of the journal.
Central venous catheterisation: internal jugular or subclavian approach?
Rigors in febrile children may be associated with a higher incidence of serious bacterial infection
Treatment of jellyfish stings in UK coastal waters: vinegar or sodium bicarbonate?
1. Carley SD, Mackway‐Jones K, Jones A, et al. Moving towards evidence based emergency medicine: use of a structured critical appraisal journal club. J Accid Emerg Med 1998;15:220–2.
2. Mackway‐Jones K, Carley SD, Morton RJ, et al. The best evidence topic report: a modified CAT for summarising the available evidence in emergency medicine. J Accid Emerg Med 1998;15:222–6.
3. Mackway‐Jones K, Carley SD. bestbets.org: Odds on favourite for evidence in emergency medicine reaches the worldwide web. J Accid Emerg Med 2000;17:235–6.
In this paper, we propose a new registration method for prone and supine computed tomographic colonography (CTC) scans using graph matching. We formulate 3D colon registration as a graph matching problem and propose a new graph matching algorithm based on mean field theory. In the proposed algorithm, we solve the matching problem in an iterative way. In each step, we use mean field theory to find the matched pair of nodes with highest probability. During iterative optimization, one-to-one matching constraints are added to the system in a step-by-step approach. Prominent matching pairs found in previous iterations are used to guide subsequent mean field calculations. The proposed method was found to have the best performance with smallest standard deviation compared with two other baseline algorithms called the normalized distance along the colon centerline (NDACC) (p=0.17) with manual colon centerline correction and spectral matching (p<1e-5). A major advantage of the proposed method is that it is fully automatic and does not require defining a colon centerline for registration. For the latter NDACC method, user interaction is almost always needed for identifying the colon centerlines.
Computed Tomographic Colonography; graph matching; mean field theory; colon registration
The identification of orthologous relationships forms the basis for most comparative genomics studies. Here, we present the second version of the eggNOG database, which contains orthologous groups (OGs) constructed through identification of reciprocal best BLAST matches and triangular linkage clustering. We applied this procedure to 630 complete genomes (529 bacteria, 46 archaea and 55 eukaryotes), which is a 2-fold increase relative to the previous version. The pipeline yielded 224 847 OGs, including 9724 extended versions of the original COG and KOG. We computed OGs for different levels of the tree of life; in addition to the species groups included in our first release (i.e. fungi, metazoa, insects, vertebrates and mammals), we have now constructed OGs for archaea, fishes, rodents and primates. We automatically annotate the non-supervised orthologous groups (NOGs) with functional descriptions, protein domains, and functional categories as defined initially for the COG/KOG database. In-depth analysis is facilitated by precomputed high-quality multiple sequence alignments and maximum-likelihood trees for each of the available OGs. Altogether, eggNOG covers 2 242 035 proteins (built from 2 590 259 proteins) and provides a broad functional description for at least 1 966 709 (88%) of them. Users can access the complete set of orthologous groups via a web interface at: http://eggnog.embl.de.
The paper investigates parameterized approximate message-passing schemes that are based on bounded inference and are inspired by Pearl's belief propagation algorithm (BP). We start with the bounded inference mini-clustering algorithm and then move to the iterative scheme called Iterative Join-Graph Propagation (IJGP), that combines both iteration and bounded inference. Algorithm IJGP belongs to the class of Generalized Belief Propagation algorithms, a framework that allowed connections with approximate algorithms from statistical physics and is shown empirically to surpass the performance of mini-clustering and belief propagation, as well as a number of other state-of-the-art algorithms on several classes of networks. We also provide insight into the accuracy of iterative BP and IJGP by relating these algorithms to well known classes of constraint propagation schemes.
We previously reported two graph algorithms for analysis of genomic information: a graph comparison algorithm to detect locally similar regions called correlated clusters and an algorithm to find a graph feature called P-quasi complete linkage. Based on these algorithms we have developed an automatic procedure to detect conserved gene clusters and align orthologous gene orders in multiple genomes. In the first step, the graph comparison is applied to pairwise genome comparisons, where the genome is considered as a one-dimensionally connected graph with genes as its nodes, and correlated clusters of genes that share sequence similarities are identified. In the next step, the P-quasi complete linkage analysis is applied to grouping of related clusters and conserved gene clusters in multiple genomes are identified. In the last step, orthologous relations of genes are established among each conserved cluster. We analyzed 17 completely sequenced microbial genomes and obtained 2313 clusters when the completeness parameter P was 40%. About one quarter contained at least two genes that appeared in the metabolic and regulatory pathways in the KEGG database. This collection of conserved gene clusters is used to refine and augment ortholog group tables in KEGG and also to define ortholog identifiers as an extension of EC numbers.
Although dozens of algorithms and tools have been developed to find a set of cis-regulatory binding sites called a motif in a set of intergenic sequences using various approaches, most of these tools focus on identifying binding sites that are significantly different from their background sequences. However, some motifs may have a similar nucleotide distribution to that of their background sequences. Therefore, such binding sites can be missed by these tools.
Here, we present a graph-based polynomial-time algorithm, MotifClick, for the prediction of cis-regulatory binding sites, in particular, those that have a similar nucleotide distribution to that of their background sequences. To find binding sites with length k, we construct a graph using some 2(k-1)-mers in the input sequences as the vertices, and connect two vertices by an edge if the maximum number of matches of the local gapless alignments between the two 2(k-1)-mers is greater than a cutoff value. We identify a motif as a set of similar k-mers from a merged group of maximum cliques associated with some vertices.
When evaluated on both synthetic and real datasets of prokaryotes and eukaryotes, MotifClick outperforms existing leading motif-finding tools for prediction accuracy and balancing the prediction sensitivity and specificity in general. In particular, when the distribution of nucleotides of binding sites is similar to that of their background sequences, MotifClick is more likely to identify the binding sites than the other tools.
The concept of orthology is key to decoding evolutionary relationships among genes across different species using comparative genomics. QuartetS is a recently reported algorithm for large-scale orthology detection. Based on the well-established evolutionary principle that gene duplication events discriminate paralogous from orthologous genes, QuartetS has been shown to improve orthology detection accuracy while maintaining computational efficiency.
QuartetS-DB is a new orthology database constructed using the QuartetS algorithm. The database provides orthology predictions among 1621 complete genomes (1365 bacterial, 92 archaeal, and 164 eukaryotic), covering more than seven million proteins and four million pairwise orthologs. It is a major source of orthologous groups, containing more than 300,000 groups of orthologous proteins and 236,000 corresponding gene trees. The database also provides over 500,000 groups of inparalogs. In addition to its size, a distinguishing feature of QuartetS-DB is the ability to allow users to select a cutoff value that modulates the balance between prediction accuracy and coverage of the retrieved pairwise orthologs. The database is accessible at https://applications.bioanalysis.org/quartetsdb.
QuartetS-DB is one of the largest orthology resources available to date. Because its orthology predictions are underpinned by evolutionary evidence obtained from sequenced genomes, we expect its accuracy to continue to increase in future releases as the genomes of additional species are sequenced.
Orthologs; Orthology detection; Orthology database
DNA microarray technology allows for the measurement of genome-wide expression patterns. Within the resultant mass of data lies the problem of analyzing and presenting information on this genomic scale, and a first step towards the rapid and comprehensive interpretation of this data is gene clustering with respect to the expression patterns. Classifying genes into clusters can lead to interesting biological insights. In this study, we describe an iterative clustering approach to uncover biologically coherent structures from DNA microarray data based on a novel clustering algorithm EP_GOS_Clust.
We apply our proposed iterative algorithm to three sets of experimental DNA microarray data from experiments with the yeast Saccharomyces cerevisiae and show that the proposed iterative approach improves biological coherence. Comparison with other clustering techniques suggests that our iterative algorithm provides superior performance with regard to biological coherence. An important consequence of our approach is that an increasing proportion of genes find membership in clusters of high biological coherence and that the average cluster specificity improves.
The results from these clustering experiments provide a robust basis for extracting motifs and trans-acting factors that determine particular patterns of expression. In addition, the biological coherence of the clusters is iteratively assessed independently of the clustering. Thus, this method will not be severely impacted by functional annotations that are missing, inaccurate, or sparse.