PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of narLink to Publisher's site
 
Nucleic Acids Res. 2011 January; 39(Database issue): D601–D605.
Published online 2010 November 25. doi:  10.1093/nar/gkq1198
PMCID: PMC3013739

PSSRdb: a relational database of polymorphic simple sequence repeats extracted from prokaryotic genomes

Abstract

PSSRdb (Polymorphic Simple Sequence Repeats database) (http://www.cdfd.org.in/PSSRdb/) is a relational database of polymorphic simple sequence repeats (PSSRs) extracted from 85 different species of prokaryotes. Simple sequence repeats (SSRs) are the tandem repeats of nucleotide motifs of the sizes 1–6 bp and are highly polymorphic. SSR mutations in and around coding regions affect transcription and translation of genes. Such changes underpin phase variations and antigenic variations seen in some bacteria. Although SSR-mediated phase variation and antigenic variations have been well-studied in some bacteria there seems a lot of other species of prokaryotes yet to be investigated for SSR mediated adaptive and other evolutionary advantages. As a part of our on-going studies on SSR polymorphism in prokaryotes we compared the genome sequences of various strains and isolates available for 85 different species of prokaryotes and extracted a number of SSRs showing length variations and created a relational database called PSSRdb. This database gives useful information such as location of PSSRs in genomes, length variation across genomes, the regions harboring PSSRs, etc. The information provided in this database is very useful for further research and analysis of SSRs in prokaryotes.

INTRODUCTION

Simple sequence repeats (SSRs), also known as microsatellites, are the repetitive nucleotide sequences ubiquitously present in all the known genomes (1–9). These sequences characteristically comprise of mono to hexa nucleotide repeats that are arranged in tandem. SSRs undergo high rates of insertion and deletion (INDEL) mutations of their repeat units as a consequence of slipped mispairing of the nascent and the template strands during replication and hence exhibit high polymorphism (10,11). The INDEL mutations of repeat units in SSRs occurs at high frequencies ranging from 10−6 to 10−2 per generation, which is much higher than base substitution rates (6,11–13). Mutations in SSRs have different effects depending on the location of SSRs relative to the organization of genes (6,14). SSRs that are located far from coding regions may evolve neutrally and have no effect on structure and function of genes. On the other hand mutations of SSRs either in the coding regions or near the regulatory regions of genes could produce considerable effects on translation or transcription of genes. Furthermore, the severity of the effect in the coding regions depends on the repeat type and the repeat location (11). Polymorphic SSRs of repeating motif length 3 or 6 nt in the coding regions of genome bring out in-frame mutations which translate into insertion or deletion of amino acid residues whereas polymorphic SSRs of non-triplet repeats (mono-, di-, tetra- and penta-nucleotide) bring out frame-shift mutations.

When one looks into abundance and length distribution of SSRs in genomes it gives an impression that SSRs are suppressed in prokaryotic genomes as compared to eukaryotic genomes (9). Nonetheless, some SSRs do show polymorphism and such SSRs have been known to render beneficial effects to prokaryotes [reviewed in (6,8,14)]. The well-documented effects have been the SSR mediated phase variation and antigenic variation which have been well-exploited by many pathogens to evade challenges offered by host immune systems and these have been studied in some bacteria (15).

Our group has been analyzing polymorphic SSRs in known prokaryotic genomes and trying to understand evolution of pathogens mediated by SSRs. During the course of our studies, we identified and extracted SSRs which show length variation among different strains and isolates available for 85 different prokaryotic species. All the data pertaining to these polymorphic SSRs (PSSRs) have further been compiled in the form of a relational database called PSSRdb. The present communication gives the details of this database.

EXTRACTION OF THE DATA PERTAINING TO PSSRS

The complete genome sequences of various species with a minimum of two strains were downloaded from NCBI (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/). Extraction of PSSRs was done by an in-house developed tool called PSSRFinder (Kumar, P. and Nagarajaram, H.A., unpublished data) whose workflow is shown in Figure 1. Essentially, PSSRFinder runs BLASTN (16) to identify equivalent SSRs (SSRs having very similar/identical flanking sequences of lengths of at least 50 bp) among all the genomes available for a species.Some essential details of the method are given below:

  1. Identification of SSRs from given genomes using SSRF (17) which reports SSR motif, motif repeat counts, co-ordinate of SSR tract in the genome and its location relative to coding and non-coding regions.
  2. Identification of equivalent SSRs along with their conserved flanking segments among various strains and isolates by using BLASTN searches with the following set of parameters: E-value ≤10−20; X drop-off value for final gapped alignment=1000; and repeat masking filter=off.
  3. Identification of PSSRs by comparing tract lengths of equivalent SSRs found in all the given genomes. If the equivalent polymorphic SSRs are part of non-coding regions in all the genomes it is annotated as non-coding PSSR. If it is found as a part of a coding region even in one of the genomes then the PSSR is referred to as coding PSSR.

Figure 1.
Schematic representation of PSSRFinder. C_PSSRF and NC_PSSRF are the two PERL programs which parse coding and non-coding PSSRs respectively from the BLAST output.

STRUCTURE OF THE DATABASE

PSSRdb has been developed using MySql (www.mysql.com). PSSRs found in coding and non-coding regions are separately stored in two different logically connected databases. Both the coding and non-coding databases contain 357 tables each of which contains useful information pertaining to PSSRs viz., motif types, repeat copy numbers of SSRs, genomic location of SSRs and information pertaining to the coding regions harboring or flanking the PSSRs. The details of the structure of the relational tables in the coding and non-coding PSSR databases are given in Tables 1 and and2,2, respectively.

Table 1.
Structure of MySQL table which is used for storing coding PSSR information
Table 2.
Structure of MySQL table which is used for storing non-coding PSSR information

OVERVIEW OF THE DATABASE AND ITS USAGE FOR DATA EXTRACTION

The Database overview is shown in Figure 2. The main page of the database contains a pull down menu containing the names of all the 85 species. Once a selection is made for a species the page is updated with the list of all the available strains belonging to the selected species. One can select two or more of the enlisted strains to query for PSSRs found in those selected set of strains. A separate option is provided to query for PSSRs found in the coding regions and the non-coding regions. A query leads to a page which gives the number of PSSRs found in the selected species. The numbers are clickable links and when clicked display pages containing the detailed information pertaining to the corresponding PSSRs. The displayed information includes the sequence of the repeat motif, its genomic location and the details of the regions harboring that repeat motif. In this page, hyperlinks are also provided to each of the listed PSSRs to design primers using PRIMER3 (14). The coding regions harboring or flanking the PSSRs are also hyperlinked to their respective annotations available at NCBI site (http://www.ncbi.nlm.nih.gov/).

Figure 2.
Overview of PSSRdb shown using screen-shots of various pages. (A) Main page containing species name which can be selected; (B) PSSRs found in the selected species; (C) Table containing the useful details of the selected coding PSSRs found in the selected ...

As mentioned earlier, PSSRs stored in PSSRdb have been identified species-wise and these correspond to those SSRs which show length variation among different strains and isolates available for each of the 85 species. In this respect, we would like to sound a word of caution. Although all the prokaryotic genomes have >10× coverage, some sequencing or assembly mistakes cannot be completely ruled out. Some of SSRs may get qualified as PSSRs as a consequence of sequencing errors or due to mistakes committed during assembly of genome sequences. It is very difficult to identify such artifacts. Nonetheless, we believe the data represented in PSSRdb makes a good starting point for further exploratory investigations on SSR polymorphism in prokaryotes.

The identification of PSSRs in a species has a very good advantage. Depending upon the region of occurrence it could have different potential application. The strain specific PSSR (SSR length varies only in one strain) could be used for the identification of that strain and is of importance in making diagnostic kits. The genes harboring PSSRs form good candidates to study the functional role of genes in pathogenesis and virulence.

FUTURE DIRECTION

A hyper link will be provided to query for the multiple sequence alignment of the PSSRs along with their flanking regions.So that user can select the number of base pairs from upstream and downstream sequence and will do the multiple sequence alignment on fly. The database will be regularly updated as and when whole genome sequences of new prokaryotes become available.

FUNDING

The work as well as the publication costs were supported by the Core fund of Centre for DNA Fingerprinting and Diagnostics (CDFD).

Conflict of interest statement. None declared.

ACKNOWLEDGEMENTS

P.K. acknowledges Senior Research Fellowship (SRF) from Council of Scientific and Industrial Research (CSIR), India.

REFERENCES

1. Schlotterer C, Tautz D. Slippage synthesis of simple sequence DNA. Nucleic Acids Res. 1992;20:211–215. [PMC free article] [PubMed]
2. Tautz D. Notes on the definition and nomenclature of tandemly repetitive DNA sequences. EXS. 1993;67:21–28. [PubMed]
3. Moxon ER, Rainey PB, Nowak MA, Lenski RE. Adaptive evolution of highly mutable loci in pathogenic bacteria. Curr. Biol. 1994;4:24–33. [PubMed]
4. Tautz D, Schlotterer C. Simple sequences. Curr. Opin. Genet. Dev. 1994;4:832–837. [PubMed]
5. Schlotterer C. Genome evolution: are microsatellites really simple sequences? Curr. Biol. 1998;8:R132–R134. [PubMed]
6. van Belkum A, Scherer S, van Alphen L, Verbrugh H. Short-sequence DNA repeats in prokaryotic genomes. Microbiol. Mol. Biol. Rev. 1998;62:275–293. [PMC free article] [PubMed]
7. Buschiazzo E, Gemmell NJ. The rise, fall and renaissance of microsatellites in eukaryotic genomes. Bioessays. 2006;28:1040–1050. [PubMed]
8. Moxon R, Bayliss C, Hood D. Bacterial contingency Loci: the role of simple sequence DNA repeats in bacterial adaptation. Annu. Rev. Genet. 2006;40:307–333. [PubMed]
9. Mrazek J, Guo X, Shah A. Simple sequence repeats in prokaryotic genomes. Proc. Natl Acad. Sci. USA. 2007;104:8472–8477. [PubMed]
10. Levinson G, Gutman GA. Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Mol. Biol. Evol. 1987;4:203–221. [PubMed]
11. Sreenu VB, Kumar P, Nagaraju J, Nagarajaram HA. Microsatellite polymorphism across the M. tuberculosis and M. bovis genomes: implications on genome evolution and plasticity. BMC Genomics. 2006;7:78. [PMC free article] [PubMed]
12. Garcia-Diaz M, Kunkel TA. Mechanism of a genetic glissando: structural biology of indel mutations. Trends Biochem. Sci. 2006;31:206–214. [PubMed]
13. Kunkel TA. DNA replication fidelity. J. Biol. Chem. 2004;279:16895–16898. [PubMed]
14. v`an der Woude MW, Baumler AJ. Phase and antigenic variation in bacteria. Clin. Microbiol. Rev. 2004;17:581–611. [PMC free article] [PubMed]
15. Brunham RC, Plummer FA, Stephens RS. Bacterial antigenic variation, host immune response, and pathogen-host coevolution. Infect. Immun. 1993;61:2273–2276. [PMC free article] [PubMed]
16. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
17. Sreenu VB, Ranjitkumar G, Swaminathan S, Priya S, Bose B, Pavan MN, Thanu G, Nagaraju J, Nagarajaram HA. MICAS: a fully automated web server for microsatellite extraction and analysis from prokaryote and viral genomic sequences. Appl. Bioinformatics. 2003;2:165–168. [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press