PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of narLink to Publisher's site
 
Nucleic Acids Res. 2010 July 1; 38(Web Server issue): W268–W274.
Published online 2010 April 30. doi:  10.1093/nar/gkq330
PMCID: PMC2896078

Phyloscan: locating transcription-regulating binding sites in mixed aligned and unaligned sequence data

Abstract

The transcription of a gene from its DNA template into an mRNA molecule is the first, and most heavily regulated, step in gene expression. Especially in bacteria, regulation is typically achieved via the binding of a transcription factor (protein) or small RNA molecule to the chromosomal region upstream of a regulated gene. The protein or RNA molecule recognizes a short, approximately conserved sequence within a gene's promoter region and, by binding to it, either enhances or represses expression of the nearby gene. Since the sought-for motif (pattern) is short and accommodating to variation, computational approaches that scan for binding sites have trouble distinguishing functional sites from look-alikes. Many computational approaches are unable to find the majority of experimentally verified binding sites without also finding many false positives. Phyloscan overcomes this difficulty by exploiting two key features of functional binding sites: (i) these sites are typically more conserved evolutionarily than are non-functional DNA sequences; and (ii) these sites often occur two or more times in the promoter region of a regulated gene. The website is free and open to all users, and there is no login requirement. Address: (http://bayesweb.wadsworth.org/phyloscan/).

INTRODUCTION

With the sequencing of many genomes, we may immediately start asking questions about the genes that are being found. The gene sequences encode proteins and other products, but what do the gene products do and what determines the quantity of expression of a gene product? The answer to the latter question is key to the study of normal and pathological cell function and differentiation; for instance, how does a muscle cell know not to produce proteins used exclusively in skin cells, and how might the regulation go awry?

There are many steps in the creation of a gene product from a gene, starting with transcription, the reading of the DNA template to create an RNA message to be used in subsequent steps. Especially in bacteria, gene regulation is typically achieved via the binding of a transcription factor (protein) or small RNA molecule to the chromosomal region upstream of a regulated gene. The protein or RNA molecule recognizes a sequence within such a promoter region and, by binding to it, either enhances or represses expression of the nearby gene.

With a collection of experimentally verified binding sites for a regulating protein or RNA in hand, or with a motif (pattern)-derived therefrom, it is natural to seek additional genes that are regulated by the same molecule. This computational process is called scanning (1–16), and it often includes multi-species data and mathematical models for exploiting phylogenetic/evolutionary relationships (17–20). However, especially because the motif is typically short (6–30 nt in length) and tolerant of variation, the determination as to whether a proposed site is a functional binding site can be difficult. Frequently, attempts to hold the level of false positives low also cause the tools to overlook too many experimentally verified binding sites. Among the purely computational approaches, the phylogeny-based tools have some advantage, because they can exploit conservation across species as suggestive of a functional binding site. Phyloscan (21) does particularly well, because it handles phylogenetic relationships whether or not a (multiple) sequence alignment is available, and also because it is able to combine the existence of multiple weak binding sites [a common occurrence (22)] into a statistically strong statement that binding does occur somewhere in a promoter region. These traits are advantageous for analyses of large multi-genomic data sets.

The Phyloscan algorithmics paper (21) describes how we use the Neuwald–Green technique (23) to statistically combine evidence from multiple sites within a promoter region, and how we use the Bailey–Gribskov technique (24) to statistically combine evidence across unaligned orthologous sequences. The algorithmics paper also describes the quantitative evaluations of Phyloscan that we have performed, and includes several measures of predictive performance, such as sensitivity, specificity and positive predictive value, as estimated from real and simulated data. Some of the earlier data are reproduced in Figure 1. Note that the ‘1 clade / 1 site’ functionality is similar to that of MONKEY (17), although MONKEY employs techniques to optimize the placement of sequence alignment gaps.

Figure 1.
Shown are receiver operating characteristic (ROC) curves for Phyloscan as applied to promoter regions containing a pair of full-strength Escherichia coli Crp binding sites, a pair of 1/2-strength sites, and a pair of 1/3-strength sites. The simulated ...

With the new web server, the underlying algorithmics remain unchanged. The new web server permits the user to supply the data to be scanned, where the older server scanned only a specific set of gamma-proteobacterial species data. The new server allows several data formats instead of requiring the use of the FASTA format. Additionally, the new web server provides a tutorial and expanded ‘help’ information.

The Phyloscan runtime is An external file that holds a picture, illustration, etc.
Object name is gkq330i13.jpg, where An external file that holds a picture, illustration, etc.
Object name is gkq330i14.jpg is the width of a binding site and An external file that holds a picture, illustration, etc.
Object name is gkq330i15.jpg the number of nucleotides in the sequences to be scanned. The constant of proportionality is ~2 An external file that holds a picture, illustration, etc.
Object name is gkq330i16.jpgs; Phyloscan scans 2 million nucleotides with a motif model of width 16 in 60 s.

THE INPUTS

For input, Phyloscan requests the information itemized below. Defaults and/or examples are available for each item.

E-mail address

The user can optionally supply an e-mail address. If it is supplied, the user will receive notification when the submitted Phyloscan job has completed. Whether or not an e-mail address is supplied, upon job submission the user will be provided a link to where the results will become available. The user can go to that web page immediately; the page refreshes every 10 s until the results become available.

Phylogenetic tree

Phyloscan exploits phylogenetic relationships among sequences that are (multiply) aligned, by employing nucleotide substitution models: non-functional nucleotides are modeled with HKY85 (25) and binding-site nucleotides are modeled with HB98 (26). To make use of these models, Phyloscan needs a phylogenetic tree relating the species from which the sequences derive. The user should attempt to find an applicable tree in the literature. Alternatively, the user can make an educated guess; Phyloscan will perform well enough if there has been a good-faith effort to give a reasonable tree topology and set of edge lengths.

The phylogenetic tree should be supplied in Newick tree format (also termed New Hampshire tree format); a description for that is available on the Phyloscan help page. The length of each phylogenetic tree edge should be supplied as a non-negative number; it is the average number of substitution events, per nucleotide position, that are expected in neutrally evolving (junk) DNA. For instance, a value of 0.1 for a phylogenetic tree edge means that, within a span of 500 nt positions, we expect an average of 50 nt substitution events to occur, in the time interval separating the ancestral and descendant sequences that are connected by that edge.

Sequences to be scanned

The user selects a file format, and supplies gene promoter (or other) sequence data to be scanned, by pasting them into a text box, or by uploading a file. Each sequence is labeled by the species from which it comes and by the gene (i.e. orthologous gene group) with which it is associated. Sequences can be supplied as aligned or unaligned, and the choice need not be consistent from gene to gene. For instance, suppose that human, chimp and baboon promoter sequences for gene ‘abc’ are aligned, and the orthologous sequences for mouse and rat are also aligned; when the data for gene ‘bcd’ is supplied, the promoter sequences from the same species can be grouped differently for alignments, and any of the sequences can be left unaligned to the others. Each supplied sequence should appear exactly once in the input data.

The supplied identifier for a sequence must conform to a specific format. The text before the first ‘.’ must match the name of a species present in the phylogenetic tree. The text after the last ‘.’ must match those sequences that are orthologous to the sequence, whether or not aligned; for example, the sequence upstream of the human ‘abc’ gene and its orthologous counterparts should be labeled with a shared identifier, such as ‘abc.’ If an identifier has more than one ‘.’, then the text between the first and last ‘.’ is ignored by Phyloscan. The letters in the nucleotide sequences can be any combination of uppercase and lowercase; Phyloscan ignores the case distinction.

Motif model

The user supplies instances of known binding sites as input to Phyloscan, so that Phyloscan can build a motif model for subsequent scanning. These instances are supplied in a user-specified format; they are pasted into the form or uploaded as a file.

From these data, Phyloscan constructs a product phylogeny model (27), also known as a phylogenetic motif model (28). Phyloscan employs the nucleotide substitution models of HKY85 (25) and HB98 (26) for neutral- and functional-position evolution, respectively.

All supplied binding sites should be unaligned, gapless, and of the same length. Known binding sites can be found in public databases such as JASPAR (29), PAZAR (30) PRODORIC (31), RegTransBase (32) and TRANSFAC (33).

Palindrome flag

The user specifies whether Phyloscan should assume that the supplied known binding sites are palindromes: when a nucleotide sequence (read from 5′ to 3′) is identical to the Watson–Crick complementary sequence to which it would bind in a DNA double helix (also read from 5′ to 3′), the sequence is said to be palindromic.

Many transcription factors are dimeric and recognize a motif that is palindromic; Phyloscan can exploit this common occurrence. Among other features, a check in the palindrome form box permits Phyloscan to skip the reverse scan of each supplied sequence, leading to better statistical significance for the binding sites that are located.

When the user indicates a palindromic model, each binding site supplied as part of the motif model can be supplied in either orientation, but not in both orientations. When the user indicates a non-palindromic model, all of the binding sites supplied for the motif model must have the same orientation, from the perspective of the binding protein or RNA molecule.

Fragmentation mask

Many transcription factors are relatively insensitive to the identity of the nucleotide at some positions within a binding site. For instance, a dimeric transcription factor may bind regardless of the handful of nucleotides that fall between the reverse complement ‘half-sites’ to which each constituent monomer binds. The user specifies, with an asterisk, which positions are important for binding specificity and, with a period, which positions are ignorable. When in doubt, the user should supply an asterisk for a position.

For example, if the middle six positions of a 22-nt wide binding site are not significant for binding, the supplied fragmentation mask should be

********......********

p-value cutoff

Phyloscan will report a promoter region as being likely to contain one or more binding sites if and only if there is sufficient evidence of the binding sites (i) in the primary species, as considered in isolation and (ii) in the primary species as considered in the context of the remaining orthologous sequences (see below for an explanation of the term primary species). The An external file that holds a picture, illustration, etc.
Object name is gkq330i17.jpg-value cutoff field sets the cutoff threshold for the primary species considered in isolation; for instance, a cutoff value of 0.05 will instruct Phyloscan to consider only those promoter regions with a An external file that holds a picture, illustration, etc.
Object name is gkq330i18.jpg-value of 0.05 or better in the primary species. With this cutoff, approximately 1 of 20 promoter regions that do not contain binding sites will be false positives at this stage, and Phyloscan will proceed with the analysis of the promoter region in the context of the promoter region's orthologous sequences. (Such a high interim level of false positives is acceptable because of the further processing that occurs; see An external file that holds a picture, illustration, etc.
Object name is gkq330i19.jpg-value cutoff below.)

The setting of a low (tight) value for the An external file that holds a picture, illustration, etc.
Object name is gkq330i20.jpg-value cutoff, e.g. 0.001, will cause Phyloscan to reject promoter regions that do not appear quite good in the primary species, even if they could otherwise be ‘rescued’ by the existence of high-quality binding sites in the orthologous sequences that are not aligned to the primary species' sequence. Note that a promoter region that passes such a strict cutoff is necessarily of high quality, and frequently such high quality will cause the region to pass the subsequent An external file that holds a picture, illustration, etc.
Object name is gkq330i21.jpg-value test as well, unless the second test is even more strict. On the other hand, a high (lax) value for the An external file that holds a picture, illustration, etc.
Object name is gkq330i22.jpg-value cutoff will instruct Phyloscan to not be too concerned with the quality of the binding sites in the primary species; Phyloscan will deem a promoter region to be of high quality if consideration of the primary species and orthologous sequences together so indicates. The default value, 0.05, has been chosen so that Phyloscan will identify (i) those promoter regions that have one or more high-quality binding sites in the primary species and (ii) those promoter regions that have only low-quality binding sites in the primary species but for which the conservation of those sites across the remaining species is significant evidence of the functionality of those sites. However, binding sites that are absent in a promoter region in the primary species, but present in the orthologous sequences, are unlikely to be detected when the cutoff is 0.05 (or lower).

An external file that holds a picture, illustration, etc.
Object name is gkq330i23.jpg-value cutoff

The An external file that holds a picture, illustration, etc.
Object name is gkq330i24.jpg-value cutoff is the mechanism by which Phyloscan balances the trade-off between the number and quality of the promoter regions that it identifies. The An external file that holds a picture, illustration, etc.
Object name is gkq330i25.jpg-value (also termed the false discovery rate) is the expected ratio of the number of false discoveries in an output data set to the size of the output data set. For example, for a set of 40 promoter regions reported as significant hits by Phyloscan, a An external file that holds a picture, illustration, etc.
Object name is gkq330i26.jpg-value of 0.05 would indicate that, on average, 2 of those 40 will be false discoveries (under the assumption that the statistical models that are employed perfectly model the underlying biology). This cutoff defaults to 0.001, a conservative value, to account for the fact that the actual biology is more complicated than are the statistical models that we use to analyze it.

Note that An external file that holds a picture, illustration, etc.
Object name is gkq330i27.jpg-value differs from p-value. Each is a fraction with the numerator equal to the number of false positives in an output set. However, for p-value the denominator is the expected number of negative cases (i.e. the number of promoters to which the regulatory molecule does not bind); for An external file that holds a picture, illustration, etc.
Object name is gkq330i28.jpg-value the denominator is the size of the output set.

Rank weights

Much of the strength of Phyloscan arises from its ability to combine the evidence across multiple binding sites within a promoter region. The default weight, 0.9, for the best site indicates to Phyloscan that ~90% of the time, a promoter region with one or more functional binding sites will have at least one strong binding site. The default rank weight, 0.1, for the second-best site indicates to Phyloscan that ~10% of the time, the best site will not be strong, yet the second-best site will be strong enough that the best two sites taken together cause the promoter region to be identified as functional for the transcription factor.

The user must supply one or more rank weights. Each supplied rank weight must be non-negative, and at least one of the rank weights must be positive. If the supplied rank weights do not sum to 1.0, they will be scaled proportionally.

Primary species

Once Phyloscan has accepted the above inputs and has checked that they are reasonable, it will ask the user to select a primary species. This selection influences the algorithm as discussed earlier, in the ‘An external file that holds a picture, illustration, etc.
Object name is gkq330i29.jpg-value cutoff’ section.

Acknowledgment boxes

As part of it evaluation of the user-supplied inputs, Phyloscan checks whether any species present in the phylogenetic tree fails to be present in the sequence data and, conversely, whether any species present in the sequence data fails to be present in the phylogenetic tree. If the former event arises, the user is asked to acknowledge that the extra species in the phylogenetic tree will be ignored. If the latter event occurs, the user is asked to acknowledge that the supplied sequences for the extra species will be ignored.

THE OUTPUTS

Figure 2 shows the best result calculated from the example data that is provided by the web site. Here, we describe the fields present in the output.

Figure 2.
A run with the example data set provided by our web server, for identifying Escherichia coli binding sites for Crp, gives the ‘mtlA’ gene family as the best result. The combined An external file that holds a picture, illustration, etc.
Object name is gkq330i30.jpg-value for this gene family, 3.544×10−16 ...

Gene family

Gene family is the name associated with a gene and its orthologs (if any). It is extracted from the sequences-to-be-scanned input data and is the text following the last ‘.’ in a sequence identifier.

Combined An external file that holds a picture, illustration, etc.
Object name is gkq330i35.jpg-value

The combined An external file that holds a picture, illustration, etc.
Object name is gkq330i36.jpg-value is the proportion of groups of orthologous promoter regions in the Phyloscan output of this quality or better that is expected to be false discoveries. For instance, if An external file that holds a picture, illustration, etc.
Object name is gkq330i37.jpg = 0.05 for the 40th-best reported promoter region, that result indicates that, on average, 2 among the 40 are false discoveries.

Combined An external file that holds a picture, illustration, etc.
Object name is gkq330i38.jpg-value is a measure of a promoter region and its orthologous sequences, whether aligned to it or not, when the evidence for all of the sequences and for all of the potential binding sites are considered together. This statistic reflects multiple-testing considerations. Because the statistical model only approximately models the underlying biology, we find that a value ≤ 0.001 to be statistically significant in many circumstances.

Combined p-value

The combined p-value is the probability that a randomly generated promoter region will accidentally look this good. This statistic does not reflect multiple-testing considerations, in that its computation ignores the number of promoter regions that were scanned. Similar to the combined An external file that holds a picture, illustration, etc.
Object name is gkq330i39.jpg-value, the combined p-value is a measure of a promoter region and its orthologous sequences, whether aligned to it or not, when the evidence for all of the sequences and for all of the potential binding sites are considered together.

Species name

A species name must be associated with each sequence. It will be extracted from the sequences-to-be-scanned input data, as the text preceding the first ‘.’ in the sequence identifier. It is also present in the user-supplied phylogenetic tree.

If, for a gene promoter region, a species' sequence is aligned with one or more orthologous sequences, they will be presented together in a block. The promoter p-value (described below) and the binding sites' An external file that holds a picture, illustration, etc.
Object name is gkq330i40.jpg-values (that are also described below) shown with the first species in the block are statistics applicable to the alignment block.

Promoter p-value

Promoter p-value is a measure of a single alignment block of a promoter region, when the evidence of all the sequences within the block and all the potential binding sites within the block are considered together. Promoter p-value is the probability that a randomly generated alignment block will accidentally look this good. For alignment blocks that contain sequence from the primary species, the promoter p-value will be lower than the user-specified p-value cutoff.

Site rank

The site rank is the relative strength of a potential binding site found in the sequence data. A value of ‘1' indicates that it is the strongest site found in a species' sequence data for a promoter region, a value of ‘2' indicates that it is the second strongest site, and so on.

The number of sites listed will depend upon the user-provided input rank weights and the strengths of the sites. In addition to an evaluation of its strength, via the rank weights each site is evaluated as to how surprising it is to find a site of this strength at this rank. For example, there are instances for which the discovery that the strongest site has an An external file that holds a picture, illustration, etc.
Object name is gkq330i41.jpg-value of 0.10 is not unusual, but for which the discovery that the second strongest site has a weaker An external file that holds a picture, illustration, etc.
Object name is gkq330i42.jpg-value of 0.15 is unusual. All sites that are as strong as or stronger than the most unusual site are listed.

Site sequence

Phyloscan reports the sequence of nucleotides in each potential binding site. Note that these are shown in the forward orientation, even when the site better matches the pattern when read in the reverse-complement sequence.

Sequence orientation

The sequence orientation is set to ‘F' when the forward orientation of the potential binding site matches the pattern. It is set to ‘R' when the reverse-complement sequence is the match to the pattern. When the pattern is palindromic, an ‘F' will always be indicated.

Binding site An external file that holds a picture, illustration, etc.
Object name is gkq330i43.jpg-value

The binding site An external file that holds a picture, illustration, etc.
Object name is gkq330i44.jpg-value is similar to the promoter p-value, although it does not combine evidence across multiple potential binding sites. The An external file that holds a picture, illustration, etc.
Object name is gkq330i45.jpg-value for a single potential binding site is the average number of sites in a randomly generated alignment block of this size that are expected to accidentally look this good.

Position in promoter

The location of each potential binding site in the input sequence data is reported. The first position in any input sequence is numbered ‘1' (rather than ‘0', as some computer scientists prefer). Gaps are not counted.

Clicking on the number will take the user to a web page that shows the location(s) of the potential binding site(s) graphically.

CONCLUSION

The ability to scan DNA sequence for regulatory binding sites is key to an understanding of gene regulation and its effects on normal and pathological cell function and differentiation. For the first time, our new web server brings together the use of the Bailey–Gribskov technique, for combining mixed aligned and unaligned sequence data, and the Neuwald–Green technique, for statistically combining multiple binding sites' data, into a scan engine that runs on a user's multi-genomic data sets.

FUNDING

National Science Foundation (CCF0914739); Department of Energy (DE-SC000592 to L.A.N.); National Institutes of Health (5K25HG003291 to L.A.N.); Wadsworth Center Bioinformatics Core. Funding for open access charge: National Science Foundation (CCF0914739) and Department of Energy (DE-SC000592).

Conflict of interest statement. None declared.

REFERENCES

1. Hertz GZ, Hartzell G.W., 3rd, Stormo GD. Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Comput. Appl. Biosci. 1990;6:81–92. [PubMed]
2. Quandt K, Frech K, Karas H, Wingender E, Werner T. MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res. 1995;23:4878–4884. [PMC free article] [PubMed]
3. Chen QK, Hertz GZ, Stormo GD. MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices. Comput. Appl. Biosci. 1995;11:563–566. [PubMed]
4. Prestridge DS. SIGNAL SCAN 4.0: additional databases and sequence formats. Comput. Appl. Biosci. 1996;12:157–160. [PubMed]
5. Berman BP, Nibu Y, Pfeiffer BD, Tomancak P, Celniker SE, Levine M, Rubin GM, Eisen MB. Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc. Natl. Acad. Sci. USA. 2002;99:757–762. [PubMed]
6. Kim JT, Gewehr JE, Martinetz T. Binding matrix: a novel approach for binding site recognition. J. Bioinform. Comput. Biol. 2004;2:289–307. [PubMed]
7. Loots GG, Ovcharenko I. rVISTA 2.0: evolutionary analysis of transcription factor binding sites. Nucleic Acids Res. 2004;32:W217–W221. [PMC free article] [PubMed]
8. Yellaboina S, Seshadri J, Kumar MS, Ranjan A. PredictRegulon: a web server for the prediction of the regulatory protein binding sites and operons in prokaryote genomes. Nucleic Acids Res. 2004;32:W318–320. [PMC free article] [PubMed]
9. Osada R, Zaslavsky E, Singh M. Comparative analysis of methods for representing and searching for transcription factor binding sites. Bioinformatics. 2004;20:3516–3525. [PubMed]
10. Wasserman WW, Sandelin A. Applied bioinformatics for the identification of regulatory elements. Nat. Rev. Genet. 2004;5:276–287. [PubMed]
11. Münch R, Hiller K, Grote A, Scheer M, Klein J, Schobert M, Jahn D. Virtual Footprint and PRODORIC: an integrative framework for regulon prediction in prokaryotes. Bioinformatics. 2005;21:4187–4189. [PubMed]
12. Su G, Mao B, Wang J. A web server for transcription factor binding site prediction. Bioinformation. 2006;1:156–157. [PMC free article] [PubMed]
13. Hiard S, Marée R, Colson S, Hoskisson PA, Titgemeyer F, van Wezel GP, Joris B, Wehenkel L, Rigali S. PREDetector: a new tool to identify regulatory elements in bacterial genomes. Biochem. Biophys. Res. Commun. 2007;357:861–864. [PubMed]
14. Narlikar L, Gordân R, Hartemink AJ. A nucleosome-guided map of transcription factor binding sites in yeast. PLoS Comput. Biol. 2007;3:e215. [PubMed]
15. Whitington T, Perkins AC, Bailey TL. High-throughput chromatin information enables accurate tissue-specific prediction of transcription factor binding sites. Nucleic Acids Res. 2009;37:14–25. [PMC free article] [PubMed]
16. Zambelli F, Pesole G, Pavesi G. Pscan: finding over-represented transcription factor binding site motifs in sequences from co-regulated or co-expressed genes. Nucleic Acids Res. 2009;37:W247–W252. [PMC free article] [PubMed]
17. Moses AM, Chiang DY, Pollard DA, Iyer VN, Eisen MB. MONKEY: identifying conserved transcription-factor binding sites in multiple alignments using a binding site-specific evolutionary model. Genome Biol. 2004;5:R98. [PMC free article] [PubMed]
18. Moses AM, Pollard DA, Nix DA, Iyer VN, Li X.-Y, Biggin MD, Eisen MB. Large-scale turnover of functional transcription factor binding sites in Drosophila. PLoS Comput. Biol. 2006;2:e130. [PubMed]
19. GuhaThakurta D. Computational identification of transcriptional regulatory elements in DNA sequence. Nucleic Acids Res. 2006;34:3585–3598. [PMC free article] [PubMed]
20. Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 2009;37:W202–W208. [PMC free article] [PubMed]
21. Carmack CS, McCue LA, Newberg LA, Lawrence CE. PhyloScan: identification of transcription factor binding sites using cross-species evidence. Algorithms. Mol. Biol. 2007;2:1. [PMC free article] [PubMed]
22. Gertz J, Siggia ED, Cohen BA. Analysis of combinatorial cis-regulation in synthetic and genomic promoters. Nature. 2009;457:215–218. [PMC free article] [PubMed]
23. Neuwald AF, Green P. Detecting patterns in protein sequences. J. Mol. Biol. 1994;239:698–712. [PubMed]
24. Bailey TL, Gribskov M. Methods and statistics for combining motif match scores. J. Comput. Biol. 1998;5:211–221. [PubMed]
25. Hasegawa M, Kishino H, Yano T.-a. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 1985;22:160–174. [PubMed]
26. Halpern AL, Bruno WJ. Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol. Biol. Evol. 1998;15:910–917. [PubMed]
27. Newberg LA, Thompson WA, Conlan S, Smith TM, McCue LA, Lawrence CE. A phylogenetic Gibbs sampler that yields centroid solutions for cis regulatory site prediction. Bioinformatics. 2007;23:1718–1727. [PMC free article] [PubMed]
28. Hawkins J, Grant C, Noble WS, Bailey TL. Assessing phylogenetic motif models for predicting transcription factor binding sites. Bioinformatics. 2009;25:i339–i347. [PMC free article] [PubMed]
29. Portales-Casamar E, Thongjuea S, Kwon AT, Arenillas D, Zhao X, Valen E, Yusuf D, Lenhard B, Wasserman WW, Sandelin A. JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res. 2010;38:D105–D110. [PMC free article] [PubMed]
30. Portales-Casamar E, Arenillas D, Lim J, Swanson MI, Jiang S, McCallum A, Kirov S, Wasserman WW. The PAZAR database of gene regulatory information coupled to the ORCA toolkit for the study of regulatory sequences. Nucleic Acids Res. 2009;37:D54–D60. [PMC free article] [PubMed]
31. Münch R, Hiller K, Barg H, Heldt D, Linz S, Wingender E, Jahn D. PRODORIC: prokaryotic database of gene regulation. Nucleic Acids Res. 2003;31:266–269. [PMC free article] [PubMed]
32. Kazakov AE, Cipriano MJ, Novichkov PS, Minovitsky S, Vinogradov DV, Arkin A, Mironov AA, Gelfand MS, Dubchak I. RegTransBase—a database of regulatory sequences and interactions in a wide range of prokaryotic genomes. Nucleic Acids Res. 2007;35:D407–D412. [PubMed]
33. Matys V, Fricke E, Geffers R, Gößling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV. TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 2003;31:374–378. [PMC free article] [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press