|Home | About | Journals | Submit | Contact Us | Français|
The Ciona intestinalis protein database (CIPRO) is an integrated protein database for the tunicate species C. intestinalis. The database is unique in two respects: first, because of its phylogenetic position, Ciona is suitable model for understanding vertebrate evolution; and second, the database includes original large-scale transcriptomic and proteomic data. Ciona intestinalis has also been a favorite of developmental biologists. Therefore, large amounts of data exist on its development and morphology, along with a recent genome sequence and gene expression data. The CIPRO database is aimed at collecting those published data as well as providing unique information from unpublished experimental data, such as 3D expression profiling, 2D-PAGE and mass spectrometry-based large-scale analyses at various developmental stages, curated annotation data and various bioinformatic data, to facilitate research in diverse areas, including developmental, comparative and evolutionary biology. For medical and evolutionary research, homologs in humans and major model organisms are intentionally included. The current database is based on a recently developed KH model containing 36034 unique sequences, but for higher usability it covers 89683 all known and predicted proteins from all gene models for this species. Of these sequences, more than 10000 proteins have been manually annotated. Furthermore, to establish a community-supported protein database, these annotations are open to evaluation by users through the CIPRO website. CIPRO 2.5 is freely accessible at http://cipro.ibio.jp/2.5.
The marine tunicate species Ciona intestinalis (Urochordata) has been an attractive research organism for developmental biology for more than a century (1,2). Because its transparent body and small number of constituting cells allow for the easy observation of its development, C. intestinalis had become one of the favorites of developmental biologists and thus large amounts of accumulated knowledge exist about its development and morphology (3–6). To this, the recent progress added the genome sequence and gene expression data (7–14). Furthermore, genome projects of other related species have revealed that Ciona is the closest to vertebrates among chordates, rather than the cephalochordates, such as amphioxus, which had been thought to be the closer relative based on morphology (10). Therefore, C. intestinalis turned out to be not only a good model organism for developmental biology, but also one of the most important species for understanding the origin and evolution of vertebrates.
Here we introduce the Ciona intestinalis protein database (CIPRO), an integrated comprehensive proteome database for this tunicate species. CIPRO is based on recently published, reliable gene models supplemented with data from other databases and also includes original experimental data, such as 2D-PAGE images combined with proteomic analyses (15,16) and 3D expression data (17) at various developmental stages and in adult tissues. In addition to the unpublished transcriptomic and proteomic data, the gene models in CIPRO have been automatically annotated based on bioinformatic data. Of these, more than 10 000 proteins have been further supplied with manually curated annotation based on expression data and biochemical and physiological knowledge. Because of the unique evolutionary position of the species and its simple body plan, the database should provide useful information not only to tunicate researchers, but also to researchers in fields such as developmental biology, evolutionary biology and medicine. Over the past 5 years, advances in comparative genomics have led to the sequencing of genomes of other marine invertebrates, including the sea urchin (18), sea anemone (19) and amphioxus (10). CIPRO is the first integrated protein database for a marine invertebrate and could therefore serve as a model for future marine invertebrate protein databases. In addition, molecular data related to homologs in humans and major model organisms have been included intentionally to facilitate medical and evolutionary research. The current CIPRO database is based on a recently developed KH model containing 36034 unique sequences (11). However it covers 89683 all known and predicted proteins from all gene models for this species in order to achieve higher usability for researchers using several gene models (Figure 1). All of these sequences have been automatically annotated, and more than 10 000 have been manually annotated based on large-scale transcriptomic, proteomic and bioinformatic data. In addition, we have included bioinformatic data such as 3D structural models and sequence homology data to facilitate protein comparison, as well as information about chemicals and potential antibodies that target C. intestinalis proteins. Finally, the CIPRO database website incorporates a functionality that enables the research community to evaluate and/or edit this information with ratings, curation and comments.
The CIPRO database has several unique features that reflect both the evolutionary position of the organism and the experimental omics data collected for the database. First, several bioinformatic analyses and tools have been used to highlight the relationship between C. intestinalis and other organisms, with special emphasis on humans. Sequence homology analysis based on BLAST, links to OMIM and other databases, and other bioinformatic analyses including 3D structural modeling results are presented for each protein entry. Second, omics data that include transcriptomic analyses such as EST analysis and DNA microarray data, proteomic data obtained with 2D-PAGE and large-scale LC/MS analyses, and 3D expression data, have been collected and presented with emphasis on developmental changes and distribution in adult tissues. Third, every sequence entry has been automatically annotated based on sequence homology and the presence of known functional domains; additionally, parts of the entries have been further annotated manually based on bioinformatic data, expression data and existing biochemical and physiological knowledge. Fourth, all of the data can be accessed via an original user-friendly interface that includes an extra capability for evaluation and refinement by community-wide users. Both registered and anonymous users can not only access all the data contained, but also evaluate and/or revise the contents to refine the whole database. We discuss the features in detail in the following sections.
The database is made up of protein sequences basically from our recently developed KH model containing 36034 unique sequences (11). To achieve high usability, however, it totally includes 89683 non-redundant sequences derived from all gene models available for C. intestinalis, as will be discussed below in detail. As shown in Figure 1, it contains five publicly available proteomes of C. intestinalis plus PROCITS data set (20). Although the number is still too large for a proteome compared with other species, we have chosen to retain all the sequence entries for the following reasons. First, those proteomes share relatively small number of sequences and thus hardly reducible (Figure 1). Second, even if they are clustered together by the BLASTClust program so that only distinct sequences are included, the number of entries is still 70493, or 79% of the combined unique sequence set. Third, each gene model bears a unique identifier and many of identifiers are used in the published independent experimental results. Therefore the original identifiers are convenient as reference for many cases. Finally, none of the gene models is perfect, and we are still in the process of sorting out the entries based on manual annotation. We also expect that the capability for user-based annotation/refinement will facilitate the process by filtering out some unrealistic entries. This may reflect the uncertainty in gene prediction in the genome of this species. In addition, it might partly be explained by the existence of trans-splicing, which is not common for other model organisms. To examine this, integrated data representation of CIPRO, including comparative data, should be helpful for further investigation. Especially, comparison with the proteome of C. savignyi, a closest relative species whose whole genome sequence was determined, would silhouette the shape of true proteome of C. intestinalis. We therefore included the BLAST results against C. savignyi proteome. The target proteome is composed of known, novel and ab initio predicted peptides, where they are distinguishable by identifier and remark.
Individual protein data derived from bench experiments and bioinformatics analyses are presented in a single panel in CIPRO, as shown in Figure 2. The left side of the panel shows the basic text information, including sequence, length, deduced molecular weight, isoelectric point, summary of homology search results, domain search result, gene ontology (21), disease information for human homolog, cross references, automatic annotation, link to simple phylogeny, assignment results to the KEGG OC clusters and duplicate sequences. The right side shows the experimental results and a graphical representation of the results of bioinformatics analyses. The experimental data include 2D-PAGE images with identified spots, photographs of the cellular localization and a complex chart of mRNA and protein expression profiles based on EST, microarray and 2D-gel data. The bioinformatics analyses include cytolocalization, predicted 3D structure, predicted secondary structure with modification sites, chromosomal location of human homologs and a summary of BLAST hits. In addition to this information, a user comment section is provided so that content enrichment is possible without remodeling the system. Details for each item are described below.
The photo images of 2D-PAGE gels with the highlighted spots for the protein of interest are shown in the right panel of Figure 2. There may be more than one spot for the corresponding protein, suggesting possible modification or processing of the protein. We have a separate page for 2D-PAGE analysis that includes all the identified protein spots in 2D-PAGE images with quantitative data. Comparison with other developmental stages or adult tissues is also possible.
The original experimental data also includes 3D protein localization (3DPL). Spatiotemporal localization images of each protein were determined by immunolocalization and GFP-fusion protein expression [Figure 2(7)]. The 3DPL data and related information (cellular localization, staining method, developmental stage, experimental condition, corresponding articles, etc.) are linked to the information for corresponding developmental stage of the C. intestinalis embryo (17). These data help users to understand the cellular and developmental functions of each protein and can also be used as control data for comparing phenotypes among mutants in knockdown or overexpression experiments.
The graph labeled ‘Expression Profile’ is a summary of gene expression data from EST, microarray and 2D-PAGE protein quantification data. The raw value is displayed by mousing over the graph. By summarizing all of these data in a single chart, differences in expression between mRNA and protein are easily observable, though they may also reflect experimental fluctuations. Note that the data are based on real laboratory experiments, some columns or categories may be missing because of the absence of the data observed or obtained.
We made the database based on our recently developed KH model containing 36 034 unique sequences (11). However several sets of gene models exist for C. intestinalis, as mentioned above. Therefore we finally incorporated all the existing protein models available to date, including those from Kyoto Gene (KG) (22), KH (successor of KG, http://ghost.zool.kyoto-u.ac.jp/indexr1.html) (11), PROCITS (20), JGI versions 1 and 2 (http://genome.jgi-psf.org/ciona4/ciona4.home.html, http://www.broad.mit.edu/annotation/ciona/) and Ensembl (version 58.2). Identical sequences across gene models were unified to produce a total of 89 683 protein entries. The entries are accessible by all names and accession numbers in the original gene models. Automated annotation was done to these entries according to the criteria in Table 1 and as shown in Figure 2(5).
The amino acid sequences in the CIPRO database are derived from all C. intestinalis gene models available as of April 2010. To maintain consistency and avoid confusion, the original identifiers for all gene models have been retained, with the exception of the KG2005 gene models, to which the prefix ‘CIPRO’ has been used instead of ‘KG2005’. In some cases, genes containing more than one coding sequence (i.e. those separated by stop codons) are found in the original gene models. In the present CIPRO database, these are treated as separate sequences and marked with numerical suffixes (e.g. .1, .2 and so on). For consistency’s sake, the entire sequence from the original gene model, including stop codons, is indicated with the suffix dot-zero (.0).
Figure 2 shows a screenshot of a typical protein entry. The top of the left panel [Figure 2 (2)] shows the protein information, including deduced amino acid sequence, length, calculated molecular weight, isoelectric point and protein name candidates. A link to the NCBI BLAST server is also provided with the sequence field already filled in, so that users can execute their own homology search [Figure 2 (3)].
The bottom-right corner of the panel [Figure 2(8)] shows the top hits from homology searches for each selected model organism, making the protein names in each species easily recognizable. A histogram of BLAST hits is also shown on the right panel to allow for the identification of potential protein families.
The ‘OMIM’ tag shows the information for human homologs and associated disease information with a direct link to the corresponding NCBI webpage [Figure 2 (4)]. Where available, the loci of the human homologs are also shown graphically on the right panel.
A summary of the domains and motifs identified by InterProScan 4.5 (InterPro version 22.0) (23) is shown with the corresponding InterPro identifiers and definitions based on information from PFAM (24), GO, PROSITE (25), PANTHER (26) and SUPERFAMILY (27) [Figure 2(4)]. These categories are also used in the automated annotation. Any identified domains and motifs are also indicated in the box labeled ‘psipred’ in the right panel.
Automated annotation was done by H-invitational database scheme (28,29), except the criteria were modified as shown in Table 1 and Figure 2 (4). For cases in which more than one reference source was available, the top-most category was applied. For example, if a protein is similar to a predicted protein and contains a motif, it was classified as category III.
A phylogeny with a limited number of homologs, homology to the KEGG ortholog cluster (KEGG OC, ftp://ftp.genome.jp/pub/kegg/genes/oc/oc.gz) identified by utilization of KAAS (30) and putative duplicated genes are provided as links for users to obtain further biological implication [Figure 2 (5)].
One of the unique features of CIPRO is its graphical view of results from bioinformatics analyses [Figure 2 (9)]. Each icon-like picture summarizes a separate bioinformatics analysis to allow for an easy grasp of the protein character at a glance. Each component is described separately below.
The subcellular localization predicted by WoLF PSORT (31) is shown graphically as the color intensity of each organelle or cellular compartment. This original graphical representation was developed by us. The more intense is the color of a cellular part, the more probable it is that the protein is localized in that particular compartment. Some proteins are predicted to be localized to multiple compartments.
Localization of plasma membrane and transmembrane components predictions by TMHMM 2.0c (32,33) are shown graphically by using our original software tool. This feature can be used together with other annotations (including cytolocalization and text annotation) to identify protein function such as cytokine receptors and cell adhesion molecules.
3D structures modeled by Modeller 9v7 (34–36) are also presented in the graphical view. Clicking on the picture opens a Jmol (http://www.jmol.org/) window, allowing the user to manipulate the picture for rotation and magnification, change color to emphasize specific atoms or residues, etc.
The secondary structure, possible modification sites, and domains and motifs predicted by Psipred, Netphos and InterProScan, respectively, are summarized in a single graphic picture labeled as ‘psipred’. However, this label is not meant to imply that a single program was used to produce the figure. We developed a new program to generate summarized picture for the current project.
The picture labeled ‘omim’ depicts the chromosomal map location of human homologs of each protein. For multigene families, more than one location may be indicated.
Dr Di Jiang of the Sars International Centre for Marine Molecular Biology, Norway, has generously provided information about commercially available antibodies that have the potential to cross-react to Ciona proteins. This information was primarily obtained by homology searches with known epitopic sequences and does not guarantee that the antibody will cross-react, but it should be useful for experimental design.
To facilitate the improvement of annotation by visiting users, we have implemented a capability for users to input additional annotation and/or comments, which will then be subjected to rating by subsequent users. To aid the curation process, literature information, matched motif patterns and other related protein information are shown with links. To aid in annotation quality control, the annotator can record his/her name with the annotation. As a part of the CIPRO project, the members, mostly experimental biologists, have manually annotated more than 10000 entries to date. During this annotation process, we found the information on specific expression patterns during development to be especially useful.
The search function can be used to find keywords in any field, including protein name, annotator name, the number of annotations, category for the automated annotation, deduced amino acid length, calculated and observed molecular weights, isoelectric point, homolog name with specifiable expectation value threshold, expressed tissue and/or developmental stage and provided data type. The last one is especially useful for finding particular data sets that contain information of interest. The search can also be done with combinations of parameters. BLAST search and fragment mass search functions are also available. Search results are downloadable in CSV format.
Protein names were annotated with the abbreviated name (ANAME), followed by a semicolon and the descriptive name (DNAME) with annotation category, as follows:
ALDH4A1; HOMOLOGOUS TO delta-1-pyrroline-5-carboxylate dehydrogenase, mitochondrial.
For cases in which more than one name exists for a homolog, each name is listed with comma separators. If only DNAMEs are available, a semicolon and a space are placed in front of the line. For partial sequences, a comma and the keyword ‘partial’ are suffixed. Referred information sources were checked upon annotation. When more than one reference source was available, the topmost category was applied. For example, if a protein was similar to a predicted protein and it contained a motif, it was noted as category III. If experimental information is used as evidence, it is noted in the comment field, not in the annotation.
Because a standard nomenclature for C. intestinalis proteins has not yet been proposed, some gene names have a prefix of ‘ci-’ (for C. intestinalis), whereas others do not. Considering the nomenclature consensus that exists for other model organisms, we think it is important to start discussing a standard nomenclature for this species. In this context, we should point out that the CIPRO database will also serve as the thesaurus for Ciona protein names. In the current database, the established names are retained, but the ci- prefix is omitted for new protein names.
Institute for Bioinformatics Research and Development (BIRD); Japan Sciences and Technology Agency (JST); Ministry of Education, Culture, Sports, Science and Technology, Japan, Grants-in-Aid (2D-PAGE data) (No. 22112004 to K.I., partial). Funding for open access charge: Institute for Bioinformatics Research and Development (BIRD); Japan Sciences and Technology Agency (JST).
Conflict of interest statement. None declared.
We appreciate the anonymous reviewer for helpful suggestions. We thank Dr Di Jiang for the antibody data and Dr Kaoru Azumi for the microarray data. We are also grateful to Drs Alu Konno, Zhu Lihong, Akiko Hozumi, Kogiku Shiba and Daisuke Shibata for providing experimental data, Ms Feifei Zhang and Ms Yu Yoshida for documentation, usability test over browsers and suggestions on the web interface improvement, and Ms Rie Tsuchiya and Ms Rie Oshima for assistance with the official management of the project.