For input, Phyloscan requests the information itemized below. Defaults and/or examples are available for each item.
The user can optionally supply an e-mail address. If it is supplied, the user will receive notification when the submitted Phyloscan job has completed. Whether or not an e-mail address is supplied, upon job submission the user will be provided a link to where the results will become available. The user can go to that web page immediately; the page refreshes every 10 s until the results become available.
Phyloscan exploits phylogenetic relationships among sequences that are (multiply) aligned, by employing nucleotide substitution models: non-functional nucleotides are modeled with HKY85 (25
) and binding-site nucleotides are modeled with HB98 (26
). To make use of these models, Phyloscan needs a phylogenetic tree relating the species from which the sequences derive. The user should attempt to find an applicable tree in the literature. Alternatively, the user can make an educated guess; Phyloscan will perform well enough if there has been a good-faith effort to give a reasonable tree topology and set of edge lengths.
The phylogenetic tree should be supplied in Newick tree format (also termed New Hampshire tree format); a description for that is available on the Phyloscan help page. The length of each phylogenetic tree edge should be supplied as a non-negative number; it is the average number of substitution events, per nucleotide position, that are expected in neutrally evolving (junk) DNA. For instance, a value of 0.1 for a phylogenetic tree edge means that, within a span of 500 nt positions, we expect an average of 50 nt substitution events to occur, in the time interval separating the ancestral and descendant sequences that are connected by that edge.
Sequences to be scanned
The user selects a file format, and supplies gene promoter (or other) sequence data to be scanned, by pasting them into a text box, or by uploading a file. Each sequence is labeled by the species from which it comes and by the gene (i.e. orthologous gene group) with which it is associated. Sequences can be supplied as aligned or unaligned, and the choice need not be consistent from gene to gene. For instance, suppose that human, chimp and baboon promoter sequences for gene ‘abc’ are aligned, and the orthologous sequences for mouse and rat are also aligned; when the data for gene ‘bcd’ is supplied, the promoter sequences from the same species can be grouped differently for alignments, and any of the sequences can be left unaligned to the others. Each supplied sequence should appear exactly once in the input data.
The supplied identifier for a sequence must conform to a specific format. The text before the first ‘.’ must match the name of a species present in the phylogenetic tree. The text after the last ‘.’ must match those sequences that are orthologous to the sequence, whether or not aligned; for example, the sequence upstream of the human ‘abc’ gene and its orthologous counterparts should be labeled with a shared identifier, such as ‘abc.’ If an identifier has more than one ‘.’, then the text between the first and last ‘.’ is ignored by Phyloscan. The letters in the nucleotide sequences can be any combination of uppercase and lowercase; Phyloscan ignores the case distinction.
The user supplies instances of known binding sites as input to Phyloscan, so that Phyloscan can build a motif model for subsequent scanning. These instances are supplied in a user-specified format; they are pasted into the form or uploaded as a file.
From these data, Phyloscan constructs a product phylogeny model (27
), also known as a phylogenetic motif model (28
). Phyloscan employs the nucleotide substitution models of HKY85 (25
) and HB98 (26
) for neutral- and functional-position evolution, respectively.
All supplied binding sites should be unaligned, gapless, and of the same length. Known binding sites can be found in public databases such as JASPAR (29
), PAZAR (30
) PRODORIC (31
), RegTransBase (32
) and TRANSFAC (33
The user specifies whether Phyloscan should assume that the supplied known binding sites are palindromes: when a nucleotide sequence (read from 5′ to 3′) is identical to the Watson–Crick complementary sequence to which it would bind in a DNA double helix (also read from 5′ to 3′), the sequence is said to be palindromic.
Many transcription factors are dimeric and recognize a motif that is palindromic; Phyloscan can exploit this common occurrence. Among other features, a check in the palindrome form box permits Phyloscan to skip the reverse scan of each supplied sequence, leading to better statistical significance for the binding sites that are located.
When the user indicates a palindromic model, each binding site supplied as part of the motif model can be supplied in either orientation, but not in both orientations. When the user indicates a non-palindromic model, all of the binding sites supplied for the motif model must have the same orientation, from the perspective of the binding protein or RNA molecule.
Many transcription factors are relatively insensitive to the identity of the nucleotide at some positions within a binding site. For instance, a dimeric transcription factor may bind regardless of the handful of nucleotides that fall between the reverse complement ‘half-sites’ to which each constituent monomer binds. The user specifies, with an asterisk, which positions are important for binding specificity and, with a period, which positions are ignorable. When in doubt, the user should supply an asterisk for a position.
For example, if the middle six positions of a 22-nt wide binding site are not significant for binding, the supplied fragmentation mask should be
Phyloscan will report a promoter region as being likely to contain one or more binding sites if and only if there is sufficient evidence of the binding sites (i) in the primary species, as considered in isolation and (ii) in the primary species as considered in the context of the remaining orthologous sequences (see below for an explanation of the term primary species). The
-value cutoff field sets the cutoff threshold for the primary species considered in isolation; for instance, a cutoff value of 0.05 will instruct Phyloscan to consider only those promoter regions with a
-value of 0.05 or better in the primary species. With this cutoff, approximately 1 of 20 promoter regions that do not contain binding sites will be false positives at this stage, and Phyloscan will proceed with the analysis of the promoter region in the context of the promoter region's orthologous sequences. (Such a high interim level of false positives is acceptable because of the further processing that occurs; see
-value cutoff below.)
The setting of a low (tight) value for the
-value cutoff, e.g.
0.001, will cause Phyloscan to reject promoter regions that do not appear quite good in the primary species, even if they could otherwise be ‘rescued’ by the existence of high-quality binding sites in the orthologous sequences that are not aligned to the primary species' sequence. Note that a promoter region that passes such a strict cutoff is necessarily of high quality, and frequently such high quality will cause the region to pass the subsequent
-value test as well, unless the second test is even more strict. On the other hand, a high (lax) value for the
-value cutoff will instruct Phyloscan to not be too concerned with the quality of the binding sites in the primary species; Phyloscan will deem a promoter region to be of high quality if consideration of the primary species and orthologous sequences together so indicates. The default value, 0.05, has been chosen so that Phyloscan will identify (i) those promoter regions that have one or more high-quality binding sites in the primary species and (ii) those promoter regions that have only low-quality binding sites in the primary species but for which the conservation of those sites across the remaining species is significant evidence of the functionality of those sites. However, binding sites that are absent in a promoter region in the primary species, but present in the orthologous sequences, are unlikely to be detected when the cutoff is 0.05 (or lower).
-value cutoff is the mechanism by which Phyloscan balances the trade-off between the number and quality of the promoter regions that it identifies. The
-value (also termed the false discovery rate) is the expected ratio of the number of false discoveries in an output data set to the size of the output data set. For example, for a set of 40 promoter regions reported as significant hits by Phyloscan, a
-value of 0.05 would indicate that, on average, 2 of those 40 will be false discoveries (under the assumption that the statistical models that are employed perfectly model the underlying biology). This cutoff defaults to 0.001, a conservative value, to account for the fact that the actual biology is more complicated than are the statistical models that we use to analyze it.
-value differs from p
-value. Each is a fraction with the numerator equal to the number of false positives in an output set. However, for p
-value the denominator is the expected number of negative cases (i.e.
the number of promoters to which the regulatory molecule does not bind); for
-value the denominator is the size of the output set.
Much of the strength of Phyloscan arises from its ability to combine the evidence across multiple binding sites within a promoter region. The default weight, 0.9, for the best site indicates to Phyloscan that ~90% of the time, a promoter region with one or more functional binding sites will have at least one strong binding site. The default rank weight, 0.1, for the second-best site indicates to Phyloscan that ~10% of the time, the best site will not be strong, yet the second-best site will be strong enough that the best two sites taken together cause the promoter region to be identified as functional for the transcription factor.
The user must supply one or more rank weights. Each supplied rank weight must be non-negative, and at least one of the rank weights must be positive. If the supplied rank weights do not sum to 1.0, they will be scaled proportionally.
Once Phyloscan has accepted the above inputs and has checked that they are reasonable, it will ask the user to select a primary species. This selection influences the algorithm as discussed earlier, in the ‘
-value cutoff’ section.
As part of it evaluation of the user-supplied inputs, Phyloscan checks whether any species present in the phylogenetic tree fails to be present in the sequence data and, conversely, whether any species present in the sequence data fails to be present in the phylogenetic tree. If the former event arises, the user is asked to acknowledge that the extra species in the phylogenetic tree will be ignored. If the latter event occurs, the user is asked to acknowledge that the supplied sequences for the extra species will be ignored.