|Home | About | Journals | Submit | Contact Us | Français|
Edited by Katsumi Isono
The cyanobacterium, Synechocystis sp. PCC 6803, was the first photosynthetic organism whose genome sequence was determined in 1996 (Kazusa strain). It thus plays an important role in basic research on the mechanism, evolution, and molecular genetics of the photosynthetic machinery. There are many substrains or laboratory strains derived from the original Berkeley strain including glucose-tolerant (GT) strains. To establish reliable genomic sequence data of this cyanobacterium, we performed resequencing of the genomes of three substrains (GT-I, PCC-P, and PCC-N) and compared the data obtained with those of the original Kazusa strain stored in the public database. We found that each substrain has sequence differences some of which are likely to reflect specific mutations that may contribute to its altered phenotype. Our resequence data of the PCC substrains along with the proposed corrections/refinements of the sequence data for the Kazusa strain and its derivatives are expected to contribute to investigations of the evolutionary events in the photosynthetic and related systems that have occurred in Synechocystis as well as in other cyanobacteria.
Cyanobacteria are capable of oxygenic photosynthesis; they are thought to be the progenitor of plant plastids. Synechocystis sp. PCC 6803 is one of the most widely used cyanobacterial species for genetic studies for several major reasons; (i) it is naturally competent by incorporating exogenous DNA into cells that is integrated into the genome by homologous recombination at high frequency;1–3 (ii) it grows heterotrophically in the presence of glucose;3,4 (iii) the entire genome sequence was determined early on by Kaneko et al.5 The availability of the entire genome sequence facilitated post-genomic investigations such as transcriptome-, proteome-, and functional genomics studies.6
The original strain of Synechocystis was isolated from California freshwater by Kunisawa and colleagues and called the Berkeley strain;7 it was deposited in the Pasteur Culture Collection (PCC strain) and the American Type Culture Collection (ATCC strain). Williams3 subsequently isolated the glucose-tolerant (GT) strain from the ATCC strain.3 The Kazusa strain, whose genome sequence was published in 1996,5 is a derivative of a GT strain. A single representative clone of the GT strain was established for complete genome sequencing as the Kazusa strain; other strains were maintained and transferred without further cloning such as single colony isolation.8 Therefore, four substrains, PCC-, ATCC-, GT-, and Kazusa strains, all derived from the original Berkeley strain, were distributed to a number of laboratories although all of them were grouped together under the name Synechocystis sp. PCC 6803.9 Ikeuchi and Tabata8 reported that each substrain had specific mutations such as single nucleotide polymorphisms (SNPs) and indels, and some exhibited a specific phenotype. Some of the mutated loci that were different from the sequence in the database derived from the Kazusa strain have been identified such as a SNP,10 indels,6,11–13 and IS mobilization.14 However, the total number of mutations in the whole genome of each strain remained unknown and sequence variations in these major strains can be expected to raise problems in the evaluation of phenotypes of mutants constructed from these strains. The history of these major substrains and additional substrains, isolated as a single colony from the PCC- and GT strain (GT-I strain; the standard strain in Dr Ikeuchi's group) was summarized by Ikeuchi and Tabata.8 The single colonies isolated from the PCC strain were designated PCC-P (positive phototaxis) strain and PCC-N (negative phototaxis) strain based on the direction of phototactic movement.15 A derivative of the GT-I strain that acquired high light tolerance and a glucose-sensitive phenotype was designated the WL strain, which has an SNP in the pmgA gene.16,17 Thus, there are two fundamental problems for post-genomic research in bacterial molecular genetics. One is the heterogeneity of cells in the frozen stock of the culture collection centres; the other is the frequent spontaneous mutation in bacterial genomes, an event that may be unavoidable during the long cultivation of bacterial cells. As revealed in Bacillus subtilis, whole-genome resequencing is a powerful solution for obtaining the sequence information of such spontaneous mutants.18
Without question, laboratories should start their post-genomic research with genome sequence data of the ‘reference’ or ‘standard’ strain. We deciphered the three substrains of Synechocystis sp. PCC 6803, i.e. PCC-P, PCC-N, and GT-I, to reconstruct the informatics basis of the molecular biology of Synechocystis sp. PCC 6803. We identified a number of SNPs and indels in these substrains and introduced a genetic strategy to identify the mutated loci on a genome-wide level using the massive parallel sequencer. Especially, determination of the genome sequence of PCC substrains will widely contribute cyanobacterial researches using the frozen stock cells supplied from the PCC.
Synechocystis PCC-P, PCC-N, and GT-I strains were maintained as frozen stocks in the laboratory of Dr Masahiko Ikeuchi at The University of Tokyo, Japan. The PCC-P and PCC-N strains that exhibited positive- or negative-direction movement under phototaxis test conditions, respectively, were isolated by Yoshihara et al.19 as a single colony from frozen stock obtained from the French Pasteur Culture Collection (PCC strain; see catalogue of strains.9). Genomic DNA was extracted with the hot-phenol method.16
DNA was uniformly sheared into 300-bp portions using Adaptive Focused Acoustics (Covaris Inc., Woburn, MA, USA). We constructed a DNA library with a median insert size of 300 bp for a paired-end read format. The quality of the DNA library was checked with the Sanger method by Escherichia coli transformation of aliquots of the library solution. The library was sequenced on a Genome Analyser II (Illumina Inc., San Diego, CA, USA). Sample preparation, cluster generation, and 50-base paired-end sequencing were according to the manufacturer's protocols with minor modifications (Illumina paired-end cluster generation kit GAII ver. 2, 36-cycle sequencing kit ver. 3) with multiplex method using the single lane of the 8 lane flow-cell. Image analysis and ELAND alignment were with Illumina's Pipeline Analysis software ver. 1.6. Sequences passing standard Illumina GA pipeline filters were retained.
For short-read alignment and calling variants (SNPs and InDels), we used the short-read mapping software MAQ version 0.7.1,20 BWA ver. 0.5.1,21 and SAMtools ver. 0.1.9.22 MAQ alignments were done using the ‘easyrun’ option of the maq-pl script using the default parameter settings. The SNP filtering was performed using the default parameters except for the minimum consensus quality for SNPs (-q 40). BWA alignments and the subsequent variants calling using SAMtools were done using the default parameter settings. Finally, we applied the following filtering criteria to the lists of SNPs/indels: minimum read depth for SNPs calling = 3, minimum read depth for indel calling = 10, and a 60% cut-off of the percent of aligned reads calling the SNP/indel per total mapped reads at the non-reference allele sites. We also used BWA to estimate the sequence read depth affecting the coverage and accuracy of the variant calls. Structural variations were identified using BreakDancer23 with default parameters.
Read sequences were assembled de novo with the Velvet assembly programme.24 For optimization of the hash value of the assembly process we used the N50 size. The de novo assembled contigs were mapped on the genome sequence of a database derived from the Kazusa strain using MUMmer sequence alignment package25 with default settings. We then employed show-SNPs functions of the MUMmer program to produce lists of SNPs/indels which were applied the filtering criteria describe above.
The list of SNPs/indels was then annotated with in-house developed software Variant Annotator (VA) that was specifically designed to check amino acid substitutions attributable to the large number of identified SNPs/indels using Genbank annotation files. We used the GenBank, RefSeq, cyanobacterial database Cyanobase (http://genome.kazusa.or.jp/cyanobase), CyanoClust,26 and also ORF information of the GT-S strain,27 which is the recently resequenced substrain of Synechocystis sp. PCC 6803, for precise annotation of each ORF.
About 200-base genomic regions around the SNPs and indels called by the mapping programmes were amplified by PCR and sequenced on a capillary sequencer with the Sanger method using the commercial sequence service of MACROGEN (Tokyo, Japan). To confirm the SNPs located near IS elements or repetitive regions, the longer DNA fragments were amplified to avoid the amplification of other homologous regions in the Synechocystis genome. The primers used for confirmation are listed in Supplementary Table S1.
Short-read data, obtained on a Genome Analyzer II (Illumina Inc., San Diego, CA, USA), of the substrains of Synechocystis sp. PCC 6803, PCC-P, PCC-N, and GT-I, were deposited in the DRA (DDBJ Sequence Read Archive; http://trace.ddbj.nig.ac.jp/DRASearch/); the accession number is DRA000401. The genome sequences and gene annotations of the substrain GT-I, PCC-P, and PCC-N were also deposited in the DDBJ/GenBank/EMBL database with the accession numbers for each substrain, GT-I (AP012276), PCC-P (AP012278), and PCC-N (AP012277).
Phylogenetic relationship of various strains was estimated by the maximum parsimony method by assuming that both base change and indel are treated as a single event. The computation was performed by the dolpenny software of the Phylip package version 3.67,28 using the polymorphism option. Each branch length was set as the number of events occurring along the branch.
The amplified DNA library was sequenced by GAII with 50-base paired-end methods using the parameter settings described in Materials and methods section. We obtained 250, 257, and 221 Mb read data for GT-I, PCC-N, and PCC-P substrains, respectively (Table 1). These read depths correspond to more than 60 times the genome size of Synechocystis. Read data were mapped using three analyses to identify the genomic position of the SNPs, indels, and rearrangements (Fig. 1): (i) BWA21 and MAQ20 for mapping analysis using raw read data; (ii) Velvet24 and MUMmer25 for mapping analysis using de novo assembled contigs; and (iii) BreakDancer23 for rearrangement analysis such as IS movement. The number of mutations was called by each programme and passed through the filter settings of the SAMtools program22 (Table 1). Distributions of averaged read depth in each 1 kb along with the entire genome, obtained by BWA and MAQ were shown in Supplemental Fig. S1. At least 15 times read depth was obtained even in the lowest read depth region. It also indicates that patterns of the read depth distribution depend on the algorithm of each mapping programme. The mapping programmes BWA, MAQ, and MUMer called 76, 69, and 85 potential mutation points, respectively, for the GT-I strain as primary data, 89, 75, and 109 points for the PCC-P strain, and 78, 79, and 104 points for the PCC-N strain. To confirm these results, we checked the sequence around all these loci by Sanger sequencing; regions of ~200 bp were amplified around the position of the mutations. Depending on the parameter settings, read depth and sequence specificity, it happens that common SNPs in all three substrains were detected only in one or two substrains in each programme. Even if the SNP is called only in one strain, we performed Sanger sequencing of the same locus in all three strains. Whole oligo DNA primers prepared for PCR reactions are listed in Supplemental Table S1.
Confirmation by Sanger sequencing alerted to a number of false-positive SNP- and indel-calls. Final numbers of SNPs/indels in each substrain were only about 20–40% of the called numbers by each programme as shown in Table 1. Most of the false-positive SNPs or indels were located in repetitive regions or in highly homologous genes in the Synechocystis genome. We expected that the cut-off value of mapping programmes as 60% is enough to detect a number of heterogeneous SNPs of Synechocystis which has multi-copy genome, because 80% SNPs were called as heterogeneous SNPs by BWA in case of GT-I strain. However, we could not detect any heterogeneous SNPs among them by the Sanger method. It indicates that there are technical problems in expecting the total number of the heterogeneous SNPs in the whole genome using cut-off value settings or base-call percentage data obtained by these mapping programmes. On the other hand, all 16 SNPs called homogeneous SNPs by BWA in GT-I strain were also confirmed by the Sanger method. The total number of mutations confirmed by the Sanger method is shown in Fig. 2 with the number of mutations including false-positive data in parenthesis. These numbers are the sum of results obtained for the three substrains. We found that the combinatorial use of several programmes is necessary for the comprehensive detection of SNPs/indels and for the identification of mutations. Mutation loci detected commonly by all three programmes were more reliable than that detected by only the single programme (Fig. 2). Difference of the distribution pattern of the read depth in each mapping programme also suggests that combination use of several programmes is more adequate (Supplemental Fig. S1). However, it is worth noting, SNPs/indels identified commonly by three programmes covered only 60% of the total number of mutations. The correct detection of IS movements reported by Okamoto et al.14 was possible only with BreakDancer, a programme developed for the detection of genome rearrangements.23
The confirmed mutations are listed in Tables 2 and and3.3. The three substrains, GT-I, PCC-P, and PCC-N, manifested at least 22 common different sites compared with the sequence of the Kazusa strain in the database. Among these mutation sites, 15 sites were different from the database sequence, but not real differences in the genomic loci as revealed by Tajima et al.27 (Table 2). We also found that there are totally 14 mutations between the GT/Kazusa- and the PCC-P/PCC-N strains (Table 3). The PCC-P and PCC-N substrains contained three and eight additional specific mutations, respectively (Table 3). These may be potential mutations that elicited the known difference in the phototactic phenotype of the PCC substrains. For example, the PCC-N substrain has mutation in the gspE2 (pilB2) gene for pilus assembly, which also moderately affects the transformation efficiency.15 The PCC-N strain also has a 12-base deletion in the kinase domain of the hik33 gene for the histidine kinase without a frameshift. Hik33 is the multi-stress sensor in Synechocystis and it is conserved in all cyanobacterial species.29–31 This suggests that this substrain may lose the Hik33-dependent regulation of global gene expression, although the relationship between hik33 and phototaxis remains to be determined.
A part of the indel regions was miss-called as SNPs by the mapping programmes. We aligned the genome sequence of the substrains around the indels (Fig. 3) and found that these indels were located in the middle of two direct repeat- or direct repeat-like sequences. Interestingly, these direct repeats found in the deleted regions were not common sequence. Mapping analyses using de novo assembled contigs (Velvet and MUMmer) help to correct the false-positive SNP/indel-calls made by mapping programmes such as BWA and MAQ (Table 2).
Some of the mutations confirmed by the Sanger method were putative heterogeneous SNPs. At these SNP loci, we detected a second peak from the other nucleotide. As the Synechocystis genome is multi-copy, some of the genome copies may harbour heterogeneous SNPs. These loci were not in a multi-repeat region of the genome.
A phylogenetic scheme of the history of the Synechocystis sp. PCC 6803 substrains is presented in Fig. 4. The predicted phylogenetic relationship of various strains indicated that the putative root may exist between the PCC branch and the GT branch. It suggests that all existing substrains do not have the original sequence of this organism isolated in 1968.
Frozen stocks in culture centres are basically stored as heterogeneous cell groups of the particular bacteria, such as E. coli K-12 MG1655.32 Thus, when aliquots of the frozen cells are thawed, selection bias due to the growth conditions arises; this affects the major genotype of the cells in culture. We posit that the heterogeneity of cells in different laboratories affected the results of genetic research or phenotypic analyses and we suggest that post-genomic research on cyanobacteria in the next generation should be based on resequenced strains as the new reference for each laboratory. In studies on post-genomic science using various mutants, it is better to prepare the frozen stock cells of single colonies one by one and also of the parental cell as soon as possible after the isolation of the mutants. It is also unquestionable that long-term cultivation of the cells on the gels or in the liquid cultures is not appropriate to keep the genotype of the cells.
We found a small number of substrain-specific mutations in PCC-P and PCC-N strains. It can be expected that a mutation accounts for the difference in the phototactic phenotype of PCC-P and PCC-N strains.19 Furthermore, the 14 mutations found in the GT/Kazusa- and the PCC-P/PCC-N strains suggest that their loci may affect the cell motility of these strains8 or their glucose tolerance.3 Our findings may be useful in functional studies on the gene that is disrupted by SNPs or indels in specific substrains. As Kamei et al.12 or Okamoto et al.14 suggested, the real function of several genes could only be studied in specific substrains which have the original nucleotide sequence without any SNPs, indels, or IS insertions.
Confirmation of our results with the Sanger method revealed that specific deletion patterns were miss-called or undetected at high frequency by mapping analysis of short-read type data. Several deleted regions in the middle of direct-repeat sequences were miss-called as SNPs or undetected (Tables 2 and and3).3). Alignment of the DNA sequence of the three substrains clearly showed direct repeats on both sides of the deletion locus (Fig. 3). After the deletion event only a single direct repeat sequence remained, leading to the hypothesis that the deletion was due to self-crossover. At present we do not know what mechanisms or factors trigger these deletions, but caution should be exercised when we perform long-term cultivation of cyanobacterial cells. This is also a technical problem of mapping analysis to detect the exact positions of SNPs and indels using massive short-read data of next-generation sequencers.
In this analysis, we found that the threshold value for SNP detection by BWA and MAQ was more than 60% of read data covering the SNP positions. However, Synechocystis sp. strain PCC6803 has a multi-copy genome,33 suggesting that the genome contains unidentified heterozygous SNPs below the threshold value. It is technically difficult to identify all heterozygous SNPs in the genome; if we lower the threshold, the number of false-positive SNPs increases drastically. Mutations hidden as minor heterozygous SNPs may become major SNPs under specific conditions and affect the cell phenotype, an observation reported by Hihara and Ikeuchi16 and Hihara et al.,17 who studied the pmgA mutant. The active retention of heterogeneity may be a strategy of cyanobacteria for acclimation to environmental changes. Heterogeneous SNPs found in the same locus in several substrains such as the loci 1192983 and 3098707 were the candidates to understand such mechanisms. The detection of minor heterozygous SNPs is a future problem for mapping using massive parallel sequencing.
In this study, we identified a number of differences in the genome sequence of laboratory strains and published sequence data derived from the Kazusa strain. Resequence data on PCC substrains will be useful for considering evolutionary events among Synechocystis intra-species. For the reconstruction of informatics in post-genomic studies on Synechocystis sp. PCC 6803, resequence analysis is effective and represents a powerful genetic strategy to identify potential mutation loci in spontaneous mutants with altered phenotypes.
This study was supported by Grants-in-Aid for Scientific Research from the Ministry of Education, Culture, Sports, Science and Technology (S0801025).