16S rRNA gene diversity screening using technologies like Illumina that produce multimillion sequence reads is a very appealing method for elucidating ecology concepts in complex environments such as soils. However, as indicated in the present study, there are several issues related to contemporary technology abilities and properties of screened environments that should be considered.
Sequence conservation is an important factor for determining the potential of screening depth of various taxa using an existing library. Our results () differed from previous studies assessing conserved 16S rRNA gene areas that were based on representative sequences of the total RDP database. Although overall mapping of these areas on the selected reference 16S rRNA gene was consistent for most of the conserved screened nucleic acid bases, we identified a larger number of polymorphic sites than found before 
. A potential explanation for this observation has to do with the fact that the RDP database deposited sequences are dominated by human microbiome related bacteria. A simple keyword search (e.g. “human” or “soil”) shows that in the RPD database about 56% of the ~1,000,000 deposited 16S rRNA gene sequences that are longer than 1200 bp are derived from human body related environments. On the other hand, less than 5% of the database sequences are derived from soil. This is contradictory to the estimated number of species in the two environments. About 15,000 different species were identified in the complete human microbiome project, while more than 50,000 species were estimated to exist per gram of soil 
. Therefore, it is important to consider the particulars of the studied environment during experimental design, since it is connected to the diversity of existing niches.
Interconnected to the previous discussion point is the operational fragment length for an Illumina technology application. Current Illumina technology screening abilities according to the latest available (v4–v5) chemistries are maximized when using the Genome Analyzer IIx (GAIIx) and exploiting the paired-end reading ability (obtaining reads from both sequence fragment ends). It has been demonstrated that relatively good read quality results can be obtained for read-lengths of 125 nucleotides for each of two reads per fragment (with the second read showing lower qualities at the error prone read ends) 
. Assembly of the paired-end reads per sequenced amplicon in previously published studies required a minimum of 5–12 nucleotides of read overlap 
, which reduces the operational amplicon length to a maximum of 226 bp. Moreover, our attempt to screen RDP sequences for potential tandem repeats that might interfere with assembly at the overlapping regions did not indicate that related problems would exist by selecting the option of a 10 nucleotide overlap (data not shown). Therefore, the 226 bp of amplicon screening seems like an upper limit concerning length influence on screening abilities, yet multiplexing is the major objective of technological applications and this requires the addition of barcode sequences in at least one of the two primers used. Proposed multiplexing methods involve: a) primer indexing by addition of a few unique bases on the 5′ end of one (or both) of the amplification primers plus a 2 bp linker sequence for reducing potential index stretch effects on reaction specificity during environmental sample PCR performance 
; b) use of primers with 5′ extensions with Illumina sequencing adapters, plus an index sequence 
that enables a third sequence read (in paired-end reads usage) for identification of barcodes and does not affect the operative sequence read length (similar philosophy to that of Illumina multiplexing kits 
). All these methods have their advantages and disadvantages but all of the approaches result in restriction of amplicon screening abilities to maxima of approximately 215 bp of length. This screening length was indicated as being sufficient for screening all V regions with less than 0.5% information loss. However, 16S rRNA gene conservation around V4 indicated that robust primer designing for such short amplicons (based on reference 
) is difficult to achieve for soil environmental samples.
RDP database soil derived sequences were further analyzed for assessing representation of the tested full-length sequences concerning obtained distances and taxonomy annotations during sequence comparisons, when sequence parts belonging in the tested V-regions are used. Correlation tests of generated distances of sequences belonging to the same strains for the full length sequences and their V region variants, showed an overall superior performance for the V4 region dataset, followed by V5 for both the Pearson correlation values and the dispersal of points around the applied linear model. However, when examining more carefully V region datasets, for distances of 0–13% according to FL dataset distances there appears to be a distance overestimation for V3 and an underestimation for V5 and V6. This indicates that more per base variability is accumulated in the V3 region than in the other V regions and the corresponding section in the FL sequence. Higher resolution of signature sequences can therefore be obtained at the referred OTU definitions.
Taxonomy classification of the V region and FL datasets indicated that there is some information loss along with sequence size reduction, particularly for the V6 dataset (). However, sequence classification was equal or above 70% of the total reads and above 90% of the FL classified sequences for the V3, V4 and V5 datasets even in the case of taxonomical level 5 (encompassing order, suborder and family level classifications). Thus, use of these regions provides may facilitate relatively thorough screening of taxa related to large part of the global biogeochemistry of natural environments. According to phylum level analysis results, observed taxonomical information loss of V region datasets (thus resulting in the increase of the unclassified group of sequences) was mainly derived from intermediate populated or rare phyla of the reference database. In this analysis the V6 dataset had more than twice the FL dataset unclassified sequences, while the other V region datasets had approximately 1.5 times the unclassified FL sequences. The fact that less populated phyla were also under-represented during classification is partly due to the reference database composition. Low representation of taxa in the reference database affects the classification confidence and the probability of identification via partial sequence read (word) matches while searching for closest sequences with the naïve Bayesian classifier 
The performance of the simulated analysis provided an approximation of the effect that sequence relative abundance and richness in environmental soil samples would have on diversity assessment. Overall it was shown that datasets of V-regions encompassing longer sequence stretches (V3 and V4) generated sample distances more similar to the ones produced by the FL dataset compared to V5 and V6. Such differences between the V3, V4 and the V5 dataset were not indicated in the database screening analyses performed in the first part of this study. That is possibly because of the composition of the tested soil microbiomes, having increased relative abundance of sequences showing performance differences when the trimmed V5 or V6 regions are compared with their full-length sequence variants.
Combination of Illumina sequencing technology with screening partial 16S rRNA gene sequence reads in environmental samples can be a powerful tool for microbial ecology studies. However, this combination has some limitations as a result of the sequence screening length. V3 region selection as the screened 16S rRNA gene stretch did not perform as well as when the non redundant soil derived sequence dataset was screened, but it had a superior performance when sequence frequencies came close to those found in soil environments. V4 had a high overall performance, but compared to the rest it had a reduced conservation of flanking sequence sites of the V region. This lack of conservation may be restricting concerning diversity screening depths. V5 had a desireable diversity screening depth and an overall good performance for the non redundant dataset, but the information extracted from this region showed differences with the full-length 16S rRNA gene sequence variants in the non-redundant dataset. Thus showing the effect of the composition of the tested bacterial communities to the outcome of the V5 selection approach. V6 was outperformed in all tests apart from the one of flanking sequence conservation.
Collectively, these results suggest that partial 16S rRNA gene sequence reads corresponding to single V regions have flaws compared to their FL variants in soil bacterial community studies. Nevertheless, some appear to capture the FL sequence information in a great degree. V3 properties can match the demands of many of total soil bacterial community screening studies. V5 on the other hand, is a relatively well performing representative of the shorter V regions. The shorter V regions can provide the opportunity of assessment of the sequencing quality of the reads used, since longer read parts of the sequenced amplicon strands overlap during assembly (and therefore agreement of base calling quality of the overlapped parts is examined), which is performed as part of the reconstruction of the screened V region sequence.
Incorporation of database exploration during initial experimental setup stages is strongly suggested for strategy improvement towards experimental goals. This especially holds true during primer designing phase, which is crucial concerning the quality of the produced data. Careful selection of template sequences for the primer designing process can improve primer-set collections for highly diverse environments like soil. Potentials for further methodology improvements and can be found in approaches like the use of more than a single V region screening or even the usage of multiple housekeeping genes 
. However, it must be acknowledged that part of the power of the combination of bacterial 16S rRNA gene screening with Illumina sequencing is relying on the extensive existing full or nearly full gene length related databases, something that is lacking to some degree for other genes.