The sequence alignments of the HMG DNA-binding domains from insect and mammalian group B Sox proteins suggests that the insect proteins may be separated into three distinct groups. The first, containing SoxN, aligns with the vertebrate Sox1, 2 and 3 proteins and most likely represents an orthologue of the vertebrate group B1 class. This conclusion, based on sequence, is supported by the functional analysis of group B1 proteins in vertebrates and
Drosophila. In both cases, group B1 genes are expressed from the earliest stages of CNS development and are implicated in regulating early neural specification [
21,
22,
38,
39]. In addition, we have evidence that mammalian
Sox1 genes can rescue
SoxN phenotypes in the
Drosophila CNS, supporting the view that these proteins are functionally conserved (P. Overton and S.R. unpublished observations). The group B sequences isolated from the basal chordates, acorn worm and sea squirt, have also been shown to be expressed early in the specification of the CNS [
40,
41]. Thus, it appears that all metazoans studied to date have at least one group B gene with expression marking neural lineages early in development. Further studies of primitive invertebrates will determine whether group B
Sox expression is a universal marker for CNS development.
In a previously published phylogenetic studies it was suggested that
Dichaete be classified as a Group B2 protein [
3]. However, while the analysis clearly differentiates between the group B proteins and other fly Sox proteins it could not unambiguously resolve the relationship between each of the group B proteins. In terms of function and expression, the
Dichaete gene behaves very much like a group B1 gene, it is expressed early during CNS development and is required for neural differentiation [
20,
42]. We have previously shown that the mouse
Sox2 gene efficiently rescues
Dichaete phenotypes, further supporting a functionally similarity between
Dichaete and vertebrate group B1 genes [
20,
42]. In contrast to the conclusion based on functional studies, the sequence analysis suggests that insect Dichaete DNA-binding domain sequences are markedly different from other group B1 proteins and are more similar to group B2 proteins. The conservation of the insect sequences indicates that a
Dichaete-like sequence was present at least 300 My years ago, when
Apis and the Diptera last shared a common ancestor [
18]. We believe that the functional evidence is more convincing than the arguments based on sequence alignments and therefore suggest that Dichaete represents a group B1 function that has diverged from the canonical group B1 sequence, presumably due to selection for insect-specific functions. For example, Dichaete is required for early segmentation in the
Drosophila embryo, a highly derived function, and it may be that sequence changes in the HMG-domain have been selected for such a function while still allowing a role in CNS-specification. As with
Drosophila, both
Anopheles and
Apis are long germ insects that share some aspects of early development such as the early appearance of striped domains of
even skipped expression [
43,
44]. Thus it is possible that insect
Dichaete genes have a common role in early patterning events. It will be of considerable interest to examine the complement of group B
Sox genes in Coleoptera, Homoptera or Orthoptera to see if the HMG domain sequence and gene organisation is the same as the insects so far sequenced. To investigate this we used the Dichaete DNA-binding domain to search the available sequence of the silk moth
Bombyx mori. [
45] and found a single Group B gene that was clearly an orthologue of the
Dichaete genes discussed here, containing the diagnostic Leucine and Isoleucine residues described here.
As with vertebrate group B1 genes,
SoxN and
Dichaete are expressed in broadly overlapping domains and act partially redundantly in CNS specification [
21,
22]. The close similarity between the expression and function of
SoxN and
Dichaete in the CNS raises the possibility that they arose from a common ancestor by a duplication event and may thus share some common regulatory sequences. However, when we compared the sequences 5' or 3' to
SoxN with the
Dichaete 3' sequence we could not detect any sequence similarity indicating that any conservation in regulatory sequences is not visible at a large scale; this is not entirely surprising since we cannot detect any sequence similarity between the
Dichaete regulatory sequences from
Drosophila and
Anopheles, while our analysis indicates the divergence of
SoxN and
Dichaete predates the
Drosophila-
Anopheles divergence.
Based on the sequence alignment of insect Sox21a DNA-binding domains with those of vertebrate Sox14 proteins, it is possible that Sox21a may be an orthologue of the group B2 class. It has been suggested that in chicken Sox14 and Sox21 act as antagonists of group B1 function in a subset of the developing CNS [
6]. The function of
Sox21a in
Drosophila is not known at present, however,
Sox21a is expressed late in the development of the embryonic CNS midline, a site of
SoxN and
Dichaete expression, indicating there is the potential for the type of antagonistic interaction proposed for vertebrates. The Sox21b DNA-binding domain sequence indicates that it is closely related to Dichaete. Both these proteins have a set of unique residues in their DNA-binding domains that are not found in any other group B proteins identified to date. The
Sox21b gene is conserved between the insects and its close similarity to
Dichaete suggests that both genes arose from a common origin in the ancestor of the arthropods after their divergence from the nematodes since there is no close sequence in
C. elegans or its relatives. In terms of expression,
Sox21b is expressed in the large hindgut along with Dichaete, supporting the possibility that it may also antagonise the activity of Dichaete. In this respect then
Sox21b may represent a group B2 function. It is therefore possible that insects contain 2 group B1 class activities, involved in early CNS development, and two B2 class genes. Again we emphasise that the functional assignment of the insect genes may contrast with the data derived from sequence analysis, which predicts a single group B1 gene and three group B2 genes. We suggest that the separation of group B Sox domains into a B1 class and B2 class based solely on sequence does not reflect meaningful functional differences in insects. We have initiated a functional analysis of
Sox21a and
Sox21b in the hope that we can clarify this issue.
The genome organisation of the Dichaete cluster is unusual, not only are three genes clustered together in the genome but two of them, Sox21a and Sox21b, have introns within the HMG-domain. The single Sox21a intron is conserved in all four of the insect genes suggesting that it is ancestral to the insects. Sox21b is more complex, there are six introns in melanogaster and pseudoobscura, four of these are conserved in Anopheles and two are conserved in Apis. In the Drosophila species, there are two introns in the DNA-binding domain, the first of which is present in all four insects. The second intron, in an identical location to the Sox21a intron, is only found in the two Drosophila species. A simple model of a single intron loss is therefore unlikely to account for this since both Apis and Anopheles do not have the intron. It is possible that Apis and Anophelese lost the intron independently or, alternatively, that the common ancestor of the Drosophila species gained the intron, perhaps via a gene conversion event with Sox21a. Interestingly, the two group B genes from C. elegans also contain introns in the DNA-binding domain, in identical positions in both genes, but they are in different positions to the Sox21a and Sox21b introns. This suggests that the common ancestor of insects and nematodes did not contain DNA-binding domain introns and that these have been acquired independently in both lineages.
The conservation of genome structure with the insect Dichaete cluster suggests that there may be functional constraints on the organisation. We suggest that this is likely to be a reflection of shared regulatory sequence since the region between Dichaete and Sox21b in melanogaster contains extensive regulatory sequences essential for correct Dichaete expression. We note that both Sox21a and Sox21b have expression domains that overlap with Dichaete, in the midline for Sox21a and the hindgut with Sox21b. These expression domains may therefore be controlled by common regulatory sequences and the need to maintain coordinated regulation of the three genes has maintained the integrity of the cluster in the insects. The conservation in expression between D. melanogaster and D. pseudoobscura is consistent with this view; it will be of interests to examine the expression of the all of the Sox genes in Anopheles to further explore this hypothesis.