How to calculate 16S rRNA sequence similarity values for bacterial taxonomy: Why BLAST should be avoided

Erroneous DNA G+C content data determined by HPLC and thermal melting experiments are no longer required for species descriptions

How to calculate 16S rRNA sequence similarity values for bacterial taxonomy: Why BLAST should be avoided

Nucleotide sequence similarity values are widely used for identification and description of novel species among bacterial taxonomists. There are many different algorithms available for calculating a similarity between two gene sequences, and often times it is easy to misinterpret the results. Below, is the method for obtaining nucleotide sequence similarity values for taxonomic purposes.

The calculation of sequence similarity between two genes consists of two steps:

(i) pairwise sequence alignment and
(ii) calculation of similarity value. Pairwise sequence alignment can be achieved either by using the global or local alignment algorithms. It is recommended to use the global alignment algorithm and avoid using the local alignment algorithm (Please see here for details). That’s why a BLAST-series program should NOT be used for calculating similarities. Even though, BLAST is still the best tool for identifying the most similar sequences within a large database of sequences.

In the EzBioCloud server, the closest neighboring taxa are first identified using the BLASTN program, and then a rigorous pairwise sequence alignment algorithm (Myers & Miller, 1988) is used to calculate sequence similarity. When sequence similarity is calculated, gaps are not considered. Using pairwise sequence alignment instead of multiple sequence similarity ensures that reproducibility of the similarity calculation. For example, if you obtain the sequence similarity between A and B from a pairwise sequence alignment, the value will always be the same. However, the values between A and B calculated using multiple sequence alignments among A, B, and C and A, B, and D respectively, may be different as the multiple sequence alignment algorithm tries to find the optimal solution among all sequences, not just between A and B.

These recommendations are also described in the following publications:

1. Kim, M., Oh, H.S., Park, S.C. & Chun, J. Towards a taxonomic coherence between average nucleotide identity and 16S rRNA gene sequence similarity for species demarcation of prokaryotes. Int J Syst Evol Microbiol 64, 346-51 (2014).
2. Kim, O.S. et al. Introducing EzTaxon-e: a prokaryotic 16S rRNA gene sequence database with phylotypes that represent uncultured species. Int J Syst Evol Microbiol 62, 716-21 (2012).
3. Tindall, B.J., Rossello-Mora, R., Busse, H.J., Ludwig, W. & Kampfer, P. Notes on the characterization of prokaryote strains for taxonomic purposes. Int J Syst Evol Microbiol 60, 249-66 (2010).

By Jon Jongsik Chun (CEO of ChunLab, Inc. & Professor at Seoul National Univ.)

Updated on April 4th 2016.