The phpMyAdmin program and the MySQL system were used for the construction of the bEST-DRRD. The database consists of three individual tables: the first one describes the Arabidopsis genes involved in DNA repair and replication (table name: drrd_arabidopsis), the second one describes barley ESTs similar to Arabidopsis genes (table name: est_barley) and the third one shows the barley genes cloned in the course of the project implementation (table name: gene_barley). Each table was designed specifically for each one of these three groups (Figure ). The table for Arabidopsis genes contains information about the gene function, number of mRNA molecules produced during alternative splicing, all mRNA and coding sequences, the amino-acid sequences and the length of the proteins, as well as the accession numbers of all these entries in the NCBI database. The table for ESTs contains the data about the source of each EST, the sequence of EST, the alignment of strands, the identities (similarity shared with the query), the Expect value, as well as the start and stop positions of the alignment between the Arabidopsis sequence (query) and the barley EST (subject). In the table designed for the barley genes identified and cloned in the project, information about genomic, coding and amino acid sequences are provided, together with their NCBI GenBank accession numbers, the ESTs used for the gene identification and the primers that were used during the cloning of the gene. All the tables were linked by dedicated key entries that enable the identification of a single row in each table and connect it with similar rows in the whole bEST-DRRD. This resolution ensures the elastic scanning of the database content and simultaneous browsing of the content of different tables (Figure ). Designing an individual table for each data collection enabled fast and easy modification of the database structure and the addition of new columns into the table.
Figure 1 The structure of the bEST-DRRD with the content of each component. Arrows denote the direction from input sequence/information to the outputs. Asterisks indicate that query sequence may be selected on various levels of the database browsing, because Arabidopsis (more ...)
The first step of the data gathering was searching the NCBI GenBank database for the Arabidopsis sequences and encoded polypeptides that are known to be involved in DNA replication and repair. The list of genes was assigned based on bioinformatic research and the analysis of literature data. To date, more than 200 Arabidopsis mRNA entries, including alternatively spliced versions of the transcripts, along with the sequences of encoded polypeptides, have been retrieved from the GenBank database. These sequences are used as the queries for browsing the repositories. Arabidopsis sequences along with encoded polypeptides were collected and categorised in a casual database. DNA replication-related sequences were arranged into ten groups based on the stage of the replication process they regulate: Origin recognition, Replicative helicases, Helicases’ loading factors, Initiation, GINS complex (a novel replication complex, the letters in the acronym stand for Go, Ichi, Nii, and San; five, one, two, and three in Japanese), Elongation, POLD (POLymerase Delta) clamp, PCNA (Proliferating Cell Nuclear Antigen) loading complex, Binding of ssDNA and Maturation. DNA repair and damage tolerance-related sequences were clustered according to the process they participate in: BER (Base Excision Repair), BER-related genes, NER (Nucleotide Excision Repair), MMR (Mismatch Repair), NHEJ (Non-Homologous End Joining), Photoreactivation, Rad6 pathway and damage response, which may be defined as the mechanism of DNA damage recognition, and propagating the signal to arrest the cell cycle and allow DNA repair (Table ). The second part of bEST-DRRD structure is based on the ViroBLAST platform, which was developed as a sequence alignment web server by Prof. James Mullins and his co-workers at the University of Washington, Seattle, USA [9
] and the NCBI C++ toolkit (ver. 2.2.25+). This tool was equipped with an access to several data sources in our database: Barley Genome version 0.05, containing 1 470 315 sequences that covers ca. 90% of the barley genome coding sequence, which was downloaded from http://harvest-web.org/utilmenu.wc
, and Barley ESTs Assembly 35 that contains 50 937 sequences, which was obtained from http://harvest-web.org
. Access to both repositories was kindly granted by Prof. Timothy J. Close from University of California, Riverside, USA. The rest of the repository encompasses the contents of open-access databases of the complete genomic sequences of O. sativa
, derived from RGAP 7 – the Rice Genome Annotation Project (http://www.rice.tigr.org
), and B. distachyon
). The repository section dedicated to the rice genome includes more than 55 000 sequences. In the presented database, two components of the B. distachyon
data source (downstream 1000
bp) were combined, resulting in more than 62 000 sequences. The ViroBLAST implements the NCBI C++ toolkit and may be used for a ‘basic search’ or an ‘advanced search’ method in which the search parameters may be customised. In order to establish the bEST-DRRD repository, the Arabidopsis sequences, which had been retrieved from the NCBI GenBank, were used as the queries to browse four barley sequence databases: HarvEST, TIGR Plant Transcript Assemblies (http://plantta.jcvi.org
], The IPK Crop EST Database (CR-EST) (http://pgrc.ipk-gatersleben.de/cr-est
] and the database of Computational Biology and Functional Genomics Laboratory (Gene Index Project) (http://compbio.dfci.harvard.edu
]. During ESTs retrieval from the TIGR and Gene Index Project databases, the BLASTN algorithm was applied with the matrix – blosum62, Expect value – 10 and alignments equal to 20. All the retrieved barley ESTs were annotated, categorised, grouped and ascribed to the query sequence.