DNA sequence information from the 1.5 kb small subunit 16S ribosomal RNA (rRNA) gene has been used to successfully identify and phylogenetically classify microorganisms from environmental and medical samples (
1–
3). In more ambitious efforts, the relative abundance of bacterial groups has been estimated by sequencing hundreds to thousands of 16S rRNA genes derived from a sample (
4–
9). However, a bottleneck in data analysis is encountered in creating multiple sequence alignments (MSA). The MSA is a common means of communicating a proposed positional homology among many genes using a column-by-column format. It can be stored and presented in a variety of formats but in all cases it represents a two-dimensional matrix with each row describing a gene and each column holding the nucleotide found at a certain position along the gene. Alignments are useful when gaps have been appropriately added to mark an inference of an insertion or deletion event where one sequence has a base while another sequence lacks a base at the corresponding position. This process yields sequence strings occupying an equal number of columns allowing the matrix to form a true rectangle. MSAs are desirable for annotation of conserved versus variable gene loci by observing heterogeneity along the columns, recruiting columns with sufficient data for inter-row (sequence) comparison, and calculating distance matrices for row clustering. ClustalW (
10) is a commonly used progressive MSA method for inserting gaps into sequences to achieve perfect rectangles. Hundreds of diverse sequences can be aligned using this approach to establish a ‘profile’ alignment. Later, new sequences can be added to this profile without re-computing the optimal gap placements for the entire alignment.
Frequently, when adding a candidate sequence to a MSA profile, one or more internal insertions will be discovered which cannot be accommodated in the profile. This event requires a researcher to make one of two choices: (i) allow the column count to grow whenever an insertion is required, which requires each sequence to gain more characters or (ii) allow a local misalignment within a sequence (row) so that the insertion does not disrupt the entire alignment format of the profile. Until now, the choice readily available has been the former as implemented by ClustalW. Certain objectives are left unsatisfied by this approach. In some instances, the apparent need to create new columns in the MSA owes to the presence of poor quality sequences. If allowed, the MSA could expand to a cumbersome collection of unsubstantiated insertion inferences (gaps). Ongoing comparative sequence analysis projects benefit from having a fixed column count in the MSA, enabling unchanging annotation of position-dependent features such as primer annealing locations, secondary structures and column masks. Furthermore, collaborative MSA construction becomes problematic when copies of a single profile diverge in column content as individual researchers add their own unique data. To enable fixed column counts, allow piecemeal MSA curation and support collaborations in comparative genomics the local misalignment approach is now available and implemented via NAST (Nearest Alignment Space Termination).
We have established a web service for creating NAST MSAs from user data which is intended to facilitate comparison of 16S rRNA gene sequences from bacteria and archaea. This service has performed well in aligning thousands of user-supplied sequences into a single MSA while optionally intercalating genes from reference organisms. It was created to handle large datasets produced in exploratory microbial ecology, medical microbiology and metagenomics. One unique feature is that NAST can output the MSA in a standard, consistent format of 7682 characters per sequence so that similar loci are located at dependable positions from batch to batch (necessary for large, ongoing projects). An optional pre-processing of data based on chromatogram quality scores is allowed and post-processing options include distance matrix creation and taxonomic classification using five independent curators' nomenclature.
We have received considerable positive feedback from diverse users who have collectively submitted over 1600 jobs.