Genome sequencing has generated large amounts of sequence data, and the computational challenge of identifying genes, promoters, repeats and other sequence patterns has proved demanding [
1-
3]. Average human genes are 28 kb long, with 8.8 exons of ~120 nucleotides (nt) in length that are separated by 7.8 introns [
4]. Identification of the splicing signals that flank exon boundaries involves finding the correct site amid thousands of incorrect sites. Although computationally distinguishing bulk coding from non-coding sequence has met with relative success, gene prediction programs typically identify exon boundaries with less accuracy [
5]. A number of gene-finding strategies have been developed, including comparison of genomic and expressed sequence tag (EST) sequences [
6], hidden Markov models to identify splicing motifs [
7,
8], Bayesian networks [
9], inter-species genome comparisons to identify genes with homologous structures [
10], and various combinations of these and other strategies [
11-
13]. Additionally, some progress has been made in examining exon splice sites directly in hopes of improving the accuracy of predicting these sites [
14]. Nearly all of these models have been generated using sequences isolated from either mammals (primarily human and mouse),
Arabidopsis, or yeast. As the number of sequenced invertebrate genomes increases, the need for splice site identification models designed specifically for those organisms also increases (reviewed in [
15]). Despite the growing number of invertebrate genomes, little data exist to address the accuracy of employing vertebrate-based models for splice site identification in other organisms. The traditional strategies may fall short, given that novel genes have been identified in invertebrates that lack the level of similarity to known sequences required for many of the gene finding programs, and many gene prediction programs perform poorly in the GC-poor sequences [
5] that may characterize invertebrate genomes [
16].
Sea urchins are members of the Echinoderm phylum and provide a useful and interesting phylogenetic perspective as an invertebrate at the base of the deuterostome lineage. Although the genome of the purple sea urchin,
Strongylocentrotus purpuratus was only recently sequenced [
16], the species has long served as a model for embryonic development [
17]. A number of interesting attributes of this species have been uncovered that set it apart from vertebrate or insect genomes. The GC content of the sea urchin genome is 36.9%, lower than most vertebrates [
16]. The
S. purpuratus genome is estimated to contain 23,300 genes, many of which have chordate or protostome homologues, in addition to many novel genes [
16]. The average sea urchin gene structure is similar to those in humans and is 7.7 kb long, with 8.3 exons that are 100–115 nt long, and 7.3 introns that are ~750 nt long [
16]. Little is known about the evolutionary conservation of the splicing proteins or the splicing signals in sea urchin gene models, however, ESTs matching splicing proteins are upregulated in sea urchin immune cells, or coelomocytes, in response to immune challenge [
18] and a number of gene models encoding splicing proteins have been annotated [
16]. It was therefore of interest to develop a model that could identify more accurately splicing signals within
S. purpuratus genomic sequence.
The sea urchin also has a surprisingly complex immune system [
19] consisting of a rudimentary complement system [
20] and a number of significantly expanded gene families encoding proteins that are homologous to known vertebrate immune proteins [
21]. The
185/333 gene family from the purple sea urchin is an intriguing example of diversification of a family of invertebrate immune response genes [
18,
22-
26]. The
185/333 genes are highly expressed in response to challenge with whole bacteria [
27], lipopolysaccharide (LPS; [
18]), double-stranded (dsRNA), and the fungal signature β-1,3-glucan [
25]. Of the 689 transcripts that have been isolated from 14 animals, 437 have unique sequences [
25,
26]. The diversity takes the form of substantial levels of point mutations, in addition to the presence or absence of numerous short blocks of shared sequence called
elements. The variable presence/absence of these elements defines
element patterns, of which 35 are currently known [
23,
25,
26]. Multiple sequence repeats within the messages and genes allow multiple alignments that are different but equally optimal, of which two have been analyzed extensively [
23].
Although initial reports speculated that the variation in element pattern might be the result of extensive alternative splicing [
18], the first identification of a few
185/333 genes in an early assembly of the sea urchin genome showed that the genes are small and composed of only two short exons [
23,
26]. The first exon encodes the hydrophobic leader sequence, while the second exon encodes the remainder of the open reading frame [
4], including the variable element pattern (see Additional file
1A). Further analysis of the
185/333 gene family from individual sea urchins suggests that it may be composed of 80–120 exceptionally diverse alleles [
22,
23,
26]. Of the 215 genes that have been cloned and sequenced from four different animals, 135 have distinct sequences. The genes are flanked by di- and trinucleotide repeats, and some are linked as closely as ~3 kb [
23]. This close spacing of genes is believed to promote diversification through gene duplication (unpublished data) and frequent recombination [
22]. Analysis of sequences isolated from individual animals, however, indicates that the genes and messages are not identical and suggests possible cytidine deaminase-like post-transcriptional editing of the
185/333 transcripts [
24]. Although the two exon gene structure initially ruled out alternative splicing [
23,
26], it is possible that the discord between the gene and message sequences, particularly that of non-matching element patterns in the genes vs. messages from individual sea urchins, may be the product of cryptic splice sites within the
185/333 genes. Two possibilities exist: cryptic splice sites within the genes i) alter the element patterns of the transcripts by removing whole elements, or ii) promote trans-gene splicing to form messages that are hybrids of two closely linked genes. It was therefore of interest to investigate putative cryptic splice sites within the
183/333 genes to determine whether alternative or trans-gene splicing could further increase the diversity of the
185/333 transcripts.
The results presented here test a donor splice site model that was generated using known sea urchin splice sites and is called the
Purpuratus splice site model. It incorporates non-adjacent dependencies among positions within the donor splice site [
28] and uses position weight matrices to assess the probabilities of each nucleotide (nt) in each splice site position [
29]. This method is more accurate for predicting splice sites from sea urchin sequences than a similar model constructed using vertebrate sequences [
28]. The
Purpuratus model also out-performed the Vertebrate model in predicting splice sites in protostome sequences. Although putative donor cryptic splice sites were identified within the
185/333 gene sequences, no transcripts have been identified that utilize these splice sites to either delete elements or to promote trans-gene splicing.