Bacterial Taxonomy 101
Understanding terminology and the basic concepts and practices of bacterial taxonomy is a prerequisite for proper microbiome analysis. Here, we will outline the most important of these concepts, terms and practices pertaining to microbiome bioinformatics. Here we will also be referring to “Bacteria” but these ideas can be extended to “Archaea” as well, as both use the same principles and are governed by a single formal process.
Three processes of bacterial taxonomy
Bacterial taxonomy consists of three basic processes:
- Classification: There are a great many bacterial strains living on earth. The goal of classification is to organize these strains into groups. Our basic unit is “species” and our initial attempt should be to find the groups of strains corresponding to species. Once we define one or more species, we can further define larger groups based on their phenotypic properties and phylogeny. For example, a genus consists of one or more species; a family presents one or more genera, etc. Eventually, our goal is to build a hierarchical system of groups of naturally existing strains.
- Nomenclature: Once we identify the groups (from species to phyla), we need to name them properly. There are rules for proposing and approving names of bacterial taxa. Since there is no government or parliament for the science of taxonomy, bacterial taxonomists invented official bodies and processes to control bacterial names. This is important to avoid unnecessary confusion and maximize the efficiency of supporting other scientific fields that use taxonomy (e.g., clinical diagnostics and ecology).
- Identification: Newly isolated strains can be assigned to known species using various methods, which is called identification. If we cannot assign new strains to known species because they belong to hitherto unknown species, we can say that it can not be identified. For the new species, we can go back to the process of the classification, then nomenclature. Therefore, three processes of the bacterial taxonomy are iteratively engaged. In fact, >1,000 new species of Bacteria and Archaea are being described each year.
The basic unit for taxonomic profiling
The basic unit for taxonomic profiling of a microbiome is species. Because the results of metagenomic analysis contain data derived from uncultured (not unculturable), and otherwise complex mixtures of organisms, we must define another term. Operational Taxonomic Unit (OTU), is often used for this purpose. The terms, species and OTU, are often used interchangeably but should be used separately.
- Species: a basic taxonomic unit that is defined by the formal process (we will outline later) according to the classical Linnaean taxonomy.
- OTU: Operational Taxonomic Unit. A cluster of sequences or organisms that are purely defined by the sequence similarity of DNA barcode molecules. In bacterial microbiome studies, an OTU is defined as a cluster of 16S sequences showing 97% sequence similarity. This type of OTU is also called a molecular OTU.
The goal of microbiome research is to understand the taxonomic profiles among different samples at the species or lower levels. Because of the limitation of the 16S barcode, getting information about subspecies level is impossible. However, using carefully designed sequencing and bioinformatics, species-level profiling can be achieved. The OTU serves well as a basic unit of taxonomic profiling instead of species when a reference database is not well established.
A single OTU may contain multiple species. Here, for example, is the case of Bacillus cereus which is a well-known soil bacterial species. If the OTU concept with 97% cutoff is applied, twenty-three species would be collectively be considered as a single OTU as 16S sequence similarities are >98% (see Table below, data generated from https://www.ezbiocloud.net/identify using the sequence of B. cereus type strain).
Table. 16S sequence similarities of some Bacillus species against the type strain of Bacillus cereus. Please note that CP013274_s and JH792383_s are genomospecies.
|Hit rank||Species||Strain||Accession||Pairwise Similarity (%)|
|1||Bacillus cereus||ATCC 14579||AE016877||100|
|8||Bacillus wiedmannii||FSL W8-0169||LOBC01000053||99.86431479|
|14||Bacillus thuringiensis||ATCC 10792||ACNF01000156||99.72862958|
|17||Bacillus pseudomycoides||DSM 12442||ACMX01000133||99.59294437|
|18||Bacillus weihenstephanensis||NBRC 101238||BAUY01000093||99.45725916|
|19||Bacillus mycoides||DSM 2048||ACMU01000002||99.45725916|
|21||Bacillus gaemokensis||KCTC 13318||LTAQ01000012||98.98167006|
|23||Bacillus cytotoxicus||NVH 391-98||CP000764||98.03122878|
How are bacterial species defined?
This blog explains the details of how bacterial species is defined [Learn more]. Here, we will outline the essence of currently acknowledged bacterial taxonomy.
- A bacterial species should have a designated type strain (nomenclatural type) which is a live microorganism. It should be available to anyone who wants to study it. Usually, type strains can be obtained from public or private/for-profit culture collections.
- Modern species concept tries to adopt genomics into practice. The taxonomically accepted means for calculating a similarity between two genome sequences, use a series of bioinformatic algorithms to calculate Average Nucleotide Identity (ANI) [Learn more]. See Chun et al. (2014, 2018) for how to apply ANI to bacterial taxonomy. The proposed cutoff of ANI is 95~96%. If a bacterial strain showed the ANI of this cutoff or higher to the type strain of species A, it is assigned to the species A. If we have the reference genome sequence database containing all species on earth, using genomics, the ANI-based approach will serve as a perfect platform. However, we do not have these data in hand. Not all type strains have been sequenced and there are more uncultured species than cultured ones so far.
- The 16S sequence is still widely utilized in bacterial taxonomy. The way it is used is a bit different from that of genome sequences. A recent study showed that 98.7% can be used as a cutoff for recognizing species. Again, the type strain should be compared for the taxonomic purposes.
- If a strain shows a 16S similarity of 98.7% or lower to the species A, it does not belong to species A.
- Otherwise, the strain may, or may not, belong to species A. As the similarity is higher, there is more chance of being a member of species A. However, in some exceptions, even two strains showing 100% identical 16S sequences can show <95% ANI, meaning that they belong to the different species.
- 98.7% cutoff should be used when sequencing errors are minimal. Therefore, applying 97% cutoff to defining OTUs is reasonable when single-pass NGS sequences are considered.
- In conclusion, the combination of 16S and ANI similarities can be used for the classification and identification of bacteria.
Bacterial species types used for microbiome study
Several types of species or similar terms can be defined and used for microbiome analysis.
- Species with a valid name: This is the standard type of species. The description of the species is published, and the type strain is deposited to one or more culture collections. At present, the only conditions for validating the name of a species are (i) publication in any journal and (ii) deposition of the type strain to two culture collections in two different countries. The scientific community regulates the nomenclature (the process of naming), but not taxonomy itself. Therefore, a name is validated or said “valid”, not the species. The term valid species is not correct. If you are interested in how the bacterial name is regulated, consult the International Code of Nomenclature of Prokaryotes. An example is Escherichia coli.
- Species with an invalid name: A species with the invalid name is similar to species with a valid name except that its name is not listed on the Approved List. This list of formally recognized names is published in the journal International Journal of Systematic and Evolutionary Microbiology (IJSEM). There are two types of the list. The Notification List contains the list of approved names that were published in IJSEM. This process is automatically done by an editor of IJSEM. If a paper describing new species or any other taxonomic changes was published in the journals other than IJSEM, the authors of the paper should submit the reprint to IJSEM. By doing so, the name of new or changed taxa is listed in the Validation List. Here are example articles of the Notification and Validation Lists. The main reasons for being invalid names are:
- The type strain was not deposited to two different culture collections in the two different countries, so it does not meet the condition of validation.
- The (effective) publication was not yet submitted to IJSEM for validation.
- Candidatus: Candidatus means a “candidate species”. The concept of Candidatus was first introduced by Murray & Stackebrandt (1995). It is not a part of the formal nomenclature, so you don’t italicize the name (e.g. Candidatus Carsonella ruddii, but not Candidatus Carsonella ruddii). Candidatus names are usually given to the candidate species that cannot be cultivated as pure cultures. Typical cases are the prokaryotic obligate endosymbionts of animals and plants such as Candidatus Carsonella ruddii which cannot be pure cultured.
- Phylotype: In many cases, we know that a species exists but lacks supporting data to validate its name. We call it “phylotype”. Here are typical cases of phylotypes used in the EzBioCloud database.
- Genomospecies: Genomospecies deserves a species status, and is supported by genomic data (e.g. ANI). However, it was never named, so the EzBioCloud team gave a unique name, usually derived from the accession numbers of INSDC databases. For example, the phylotype CP013274_s is represented by a genome sequence deposited to INSDC as a strain of Bacillus thuringiensis but showed <95% ANI to all of the known Bacillus species. Therefore, it is assigned as a new phylotype (equivalent to species) [Learn more]. A genomospecies in the EzBioCloud database is always represented by an accurate 16S sequence.
- 16S phylotype: If a 16S sequence is accurate, of full-length and matched with those of all known species with <98.7%, we are pretty sure that this 16S represents a species (=phylotype). There are >23 million 16S sequences in the INSDC database but not many can meet these criteria. The phylotypes that are defined by only 16S sequences are either from sequencing pure culture or metagenomic libraries. In the EzBioCloud database, extreme care is taken when we selected reference sequences representing these phylotypes using manual alignment and curation. In addition, over 2,000 phylotypes have now been added from >3 million reads that were generated by Pacific Biosciences (PacBio) long read ccs sequencing. For example, PAC001304_s [See full taxonomy] is a phylotype belonging to the genus Prevotella and constitutes >31% of a deep-sequenced human fecal sample [Explore this sample]. In fact, four out of the top ten species in this microbiome sample are phylotypes represented by EzBioCloud’s PacBio-based reference sequences.
Except for the species with a valid name, all other cases (species with an invalid name, Candidatus, phylotype) do not have a standing in the formal nomenclature. However, we could assign unique names or identifiers to all naturally existing species, which can greatly improve the taxonomic profiling of a microbiome, especially for large-scale comparisons. Please check the up-to-date statistics on the taxonomic database and diversity at https://www.ezbiocloud.net/dashboard.
The EzBioCloud team / Last edited on Feb. 19, 2018