Powered by Precision,
Driven by Quality

Microbiome Basics

Table of contents

Microbiome bioinformatics


A microbiota is the entire collection of microorganisms in a specific niche, such as the human gut or soil. The microbiome is comprised of all of the genetic material within a microbiota. The most important methodology for studying microbiomes is metagenomics which involves the massive sequencing of DNA followed by sophisticated bioinformatics.

The goals of microbiome research are to understand (i) who are the inhabitants (ii) what they do and (iii) how they do it.

Goals of the microbiome study

To achieve these goals, we need to get the taxonomic and functional profiles of microbiome samples, then  group and compare them to understand differences. For example, we could identify bacterial species responsible for causing obesity by comparing taxonomic profiles between the groups of healthy and obese subjects. High throughput DNA sequencing provides an accurate and efficient way to obtain these profiles.

Steps in Microbiome Bioinformatics

The above figure summarizes the major steps in microbiome studies. The process of bioinformatics can be divided into two steps: primary and secondary analyses.

Primary analysis in microbiome bioinformatics

In this step, NGS reads in large volume are turned into light-weight profiles. For example, if 100,000 16S NGS reads match to the sequence of Vibrio cholerae type strain, the final profile will only store only the count information, i.e., 100,000, of V. cholerae, not the raw sequence data. Similarly, NGS sequences matched to a certain functional ortholog group, e.g., K00076 involved in secondary bile acids biosynthesis, will be stored in the functional profile with only the count information. A series of software tools, called a pipeline, is used to process raw NGS data in order to generate taxonomic or functional profiles.

Workflow in metagenomics

The most popular method of generating microbiome profiles is by sequencing amplicons of phylogenetic markers. 16S and ITS are the choice of markers genes for Bacteria and fungi, respectively. It is both cheap and sufficient to capture the taxonomic structure of microbiome samples. The drawback is that only taxonomic profiles can be obtained. To obtain functional profiles, shotgun sequencing should be used. There is a way of predicting functional profiles from taxonomic profiles (See Langille et al., 2013), but the accuracy cannot be guaranteed. The following table illustrates the pros and cons of amplicon and shotgun metagenomics.

Amplicon sequencing Shotgun sequencing
Cost Low High
NGS Platforms Illumina, PacBio, Ion Torrent Illumina, Ion Proton
Reference database 16S database, most known species have been sampled Genome database, <50% of known species have been sampled
Resolution Species or taxonomic group Subspecies, if reference database is available
Output Taxonomic profiles Taxonomic/Functional Profiles
Limitation Low taxonomic resolution. No functional interpretation. Taxonomic profiling may be wrong if reference genome database has low taxonomic coverage (See Tessler et al., 2017)

Secondary analysis in microbiome bioinformatics

Once taxonomic or functional profiles of microbiome samples have been generated, using a pipeline and reference databases, they can then be compared to see differentially present taxa or functional units.  Functional units may be orthologous groups or pathways, called biomarkers. We call it secondary analysis as the sets of profiles can easily be swapped or changed out for a new analysis. Because profiles are light-weight, most secondary analyses can be run instantly or within a reasonably short time (e.g. <20 sec).

A web-based secondary analysis platform is a powerful tool, as it enables an instant and interactive process for biomarker discovery. EzBioCloud 16S-based MTP app is designed to provide an optimized means of secondary analysis with versatile visualizations and publication-ready reports.

The EzBioCloud team / Last edited on Feb. 19, 2018

Bacterial Taxonomy 101

Understanding terminology and the basic concepts and practices of bacterial taxonomy is a prerequisite for proper microbiome analysis. Here, we will outline the most important of these concepts, terms and practices pertaining to microbiome bioinformatics. Here we will also be referring to “Bacteria” but these ideas can be extended to “Archaea” as well, as both use the same principles and are governed by a single formal process.

Three processes of bacterial taxonomy

Bacterial taxonomy consists of three basic processes:

  1. Classification: There are a great many bacterial strains living on earth. The goal of classification is to organize these strains into groups. Our basic unit is “species” and our initial attempt should be to find the groups of strains corresponding to species. Once we define one or more species, we can further define larger groups based on their phenotypic properties and phylogeny. For example, a genus consists of one or more species; a family presents one or more genera, etc. Eventually, our goal is to build a hierarchical system of groups of naturally existing strains.
  2. Nomenclature: Once we identify the groups (from species to phyla), we need to name them properly. There are rules for proposing and approving names of bacterial taxa. Since there is no government or parliament for the science of taxonomy, bacterial taxonomists invented official bodies and processes to control bacterial names. This is important to avoid unnecessary confusion and maximize the efficiency of supporting other scientific fields that use taxonomy (e.g., clinical diagnostics and ecology).
  3. Identification: Newly isolated strains can be assigned to known species using various methods, which is called identification. If we cannot assign new strains to known species because they belong to hitherto unknown species, we can say that it can not be identified. For the new species, we can go back to the process of the classification, then nomenclature. Therefore, three processes of the bacterial taxonomy are iteratively engaged. In fact, >1,000 new species of Bacteria and Archaea are being described each year.

The basic unit for taxonomic profiling

The basic unit for taxonomic profiling of a microbiome is species. Because the results of metagenomic analysis contain data derived from uncultured (not unculturable), and otherwise complex mixtures of organisms, we must define another term. Operational Taxonomic Unit (OTU), is often used for this purpose. The terms, species and OTU, are often used interchangeably but should be used separately.

  • Species: a basic taxonomic unit that is defined by the formal process (we will outline later) according to the classical Linnaean taxonomy.
  • OTU:  Operational Taxonomic Unit.  A cluster of sequences or organisms that are purely defined by the sequence similarity of DNA barcode molecules. In bacterial microbiome studies, an OTU is defined as a cluster of 16S sequences showing 97% sequence similarity. This type of OTU is also called a molecular OTU.

The goal of microbiome research is to understand the taxonomic profiles among different samples at the species or lower levels. Because of the limitation of the 16S barcode, getting information about subspecies level is impossible. However, using carefully designed sequencing and bioinformatics, species-level profiling can be achieved. The OTU serves well as a basic unit of taxonomic profiling instead of species when a reference database is not well established.

A single OTU may contain multiple species. Here, for example,  is the case of Bacillus cereus which is a well-known soil bacterial species. If the OTU concept with 97% cutoff is applied, twenty-three species would be collectively be considered as a single OTU as 16S sequence similarities are >98% (see Table below, data generated from //www.ezbiocloud.net/identify using the sequence of B. cereus type strain).

Table. 16S sequence similarities of some Bacillus species against the type strain of Bacillus cereus. Please note that CP013274_s and JH792383_s are genomospecies.

File could not be opened. Check the file's permissions to make sure it's readable by your server.

How are bacterial species defined?

This blog explains the details of how bacterial species is defined [Learn more]. Here, we will outline the essence of currently acknowledged bacterial taxonomy.

  • A bacterial species should have a designated type strain (nomenclatural type) which is a live microorganism. It should be available to anyone who wants to study it. Usually, type strains can be obtained from public or private/for-profit culture collections.
  • Modern species concept tries to adopt genomics into practice. The taxonomically accepted means for calculating a similarity between two genome sequences, use a series of bioinformatic algorithms to calculate Average Nucleotide Identity (ANI) [Learn more]. See Chun et al. (20142018) for how to apply ANI to bacterial taxonomy. The proposed cutoff of ANI is 95~96%. If a bacterial strain showed the ANI of this cutoff or higher to the type strain of species A, it is assigned to the species A. If we have the reference genome sequence database containing all species on earth, using genomics, the ANI-based approach will serve as a perfect platform. However, we do not have these data in hand. Not all type strains have been sequenced and there are more uncultured species than cultured ones so far.
  • The 16S sequence is still widely utilized in bacterial taxonomy. The way it is used is a bit different from that of genome sequences.  A recent study showed that 98.7% can be used as a cutoff for recognizing species. Again, the type strain should be compared for the taxonomic purposes.
    • If a strain shows a 16S similarity of 98.7% or lower to the species A, it does not belong to species A.
    • Otherwise, the strain may, or may not, belong to species A.  As the similarity is higher, there is more chance of being a member of species A. However, in some exceptions, even two strains showing 100% identical 16S sequences can show <95% ANI, meaning that they belong to the different species.
    • 98.7% cutoff should be used when sequencing errors are minimal. Therefore, applying 97% cutoff to defining OTUs is reasonable when single-pass NGS sequences are considered.
  • In conclusion, the combination of 16S and ANI similarities can be used for the classification and identification of bacteria.

Bacterial species types used for microbiome study

Several types of species or similar terms can be defined and used for microbiome analysis.

  • Species with a valid name: This is the standard type of species. The description of the species is published,  and the type strain is deposited to one or more culture collections. At present, the only conditions for validating the name of a species are (i) publication in any journal and (ii) deposition of the type strain to two culture collections in two different countries. The scientific community regulates the nomenclature (the process of naming), but not taxonomy itself. Therefore, a name is validated or said “valid”, not the species. The term valid species is not correct. If you are interested in how the bacterial name is regulated, consult the International Code of Nomenclature of Prokaryotes. An example is Escherichia coli.
  • Species with an invalid name: A species with the invalid name is similar to species with a valid name except that its name is not listed on the Approved List. This list of formally recognized names is published in the journal International Journal of Systematic and Evolutionary Microbiology (IJSEM). There are two types of the list. The Notification List contains the list of approved names that were published in IJSEM. This process is automatically done by an editor of IJSEM. If a paper describing new species or any other taxonomic changes was published in the journals other than IJSEM, the authors of the paper should submit the reprint to IJSEM. By doing so, the name of new or changed taxa is listed in the Validation List. Here are example articles of the Notification and Validation Lists. The main reasons for being invalid names are:
    • The type strain was not deposited to two different culture collections in the two different countries, so it does not meet the condition of validation.
    • The (effective) publication was not yet submitted to IJSEM for validation.
  • Candidatus: Candidatus means a “candidate species”. The concept of Candidatus was first introduced by Murray & Stackebrandt (1995). It is not a part of the formal nomenclature, so you don’t italicize the name (e.g. Candidatus Carsonella ruddii, but not Candidatus Carsonella ruddii). Candidatus names are usually given to the candidate species that cannot be cultivated as pure cultures. Typical cases are the prokaryotic obligate endosymbionts of animals and plants such as Candidatus Carsonella ruddii which cannot be pure cultured.
  • Phylotype: In many cases, we know that a species exists but lacks supporting data to validate its name. We call it “phylotype”. Here are typical cases of phylotypes used in the EzBioCloud database.
    • Genomospecies: Genomospecies deserves a species status, and is supported by genomic data (e.g. ANI). However, it was never named, so the EzBioCloud team gave a unique name, usually derived from the accession numbers of INSDC databases. For example, the phylotype CP013274_s is represented by a genome sequence deposited to INSDC as a strain of Bacillus thuringiensis but showed <95% ANI to all of the known Bacillus species. Therefore, it is assigned as a new phylotype (equivalent to species) [Learn more]. A genomospecies in the EzBioCloud database is always represented by an accurate 16S sequence.
    • 16S phylotype: If a 16S sequence is accurate, of full-length and matched with those of all known species with <98.7%, we are pretty sure that this 16S represents a species (=phylotype). There are >23 million 16S sequences in the INSDC database but not many can meet these criteria. The phylotypes that are defined by only 16S sequences are either from sequencing pure culture or metagenomic libraries. In the EzBioCloud database, extreme care is taken when we selected reference sequences representing these phylotypes using manual alignment and curation. In addition, over 2,000 phylotypes have now been added from >3 million reads that were generated by Pacific Biosciences (PacBio) long read ccs sequencing. For example, PAC001304_s [See full taxonomy] is a phylotype belonging to the genus Prevotella and constitutes >30% of a deep-sequenced human fecal sample [Explore this sample]. In fact, four out of the top ten species in this microbiome sample are phylotypes represented by EzBioCloud’s PacBio-based reference sequences.

Except for the species with a valid name, all other cases (species with an invalid name, Candidatus, phylotype) do not have a standing in the formal nomenclature. However, we could assign unique names or identifiers to all naturally existing species, which can greatly improve the taxonomic profiling of a microbiome, especially for large-scale comparisons. Please check the up-to-date statistics on the taxonomic database and diversity at //www.ezbiocloud.net/dashboard.

The EzBioCloud team / Last edited on Feb. 19, 2018

16S copy number correction

What is the 16S copy number and why it matters?

The 16S rRNA gene (16S) has been widely used as a phylogenetic marker, particularly important for the taxonomic profiling of microbiome samples. Unlike other genes that code for proteins, the 16S-coding gene may be present in multiple copies in a single cell. Obviously, a bacterial strain must have at least one gene encoding 16S, but the copy number can go up to 15 (see the below chart). There is a positive correlation between the genome size and 16S copy number.

16S copy numbers of bacteria in EzBioCloud database (generated from only complete genomes).

When we analyze microbiome data using 16S amplicon sequences, all quantitative measures are a form of NGS read counts that are assigned to the known taxa. In this case, we actually count the number of a marker gene, typically 16S, present in a microbiome sample. However, what we eventually want to know is not the number of 16S reads but the number of corresponding cells, or CFU (colony forming units).

Let’s assume that we are analyzing a human fecal sample. After sequencing, we obtained 100 reads assigned to Bacteroides fragilis and also 100 to Prevotella copri. Both species are frequently found in the human gut. Should we say that two species are present in equal numbers? According to the EzBioCloud database which provides information about 16S copy number, B. fragilis has 6 copies whereas P. copri contains 4 copies [Learn more for B. fragilis and P. copri]. If we consider this, the corrected ratio between B. fragilis and P. copri should be 3:2, not 1:1. The necessity of 16S copy number correction or normalization has been raised by several studies (Kembel et al., 2012Angly et al., 2014Vandeputte, et al. 2017).

How to correct taxonomic profile data using 16S copy numbers

The relative taxonomic compositional data of a microbiome sample can be corrected by simple calculation once we know the 16S copy numbers of all species. A problem is that we do not know these values for all species. To obtain accurate data, one or more complete genome sequence is required. Incomplete genome assemblies derived from short NGS reads contain either no or an inaccurate number of 16S gene sequences. At present, there are 3467 species represented by complete genome sequences (As of Dec. 2017). 16S copy numbers of the remaining species, including uncultured phylotypes, should be interpolated using the existing data.

A couple of algorithms were proposed to predict the missing 16S copy numbers (Langille et al., 2013Angly et al., 2014). In EzBioCloud 16S-based MTP app, the PICRUSt algorithm (Langille et al., 2013) is used to generate the 16S copy number database for all species/phylotypes in the EzBioCloud 16S database (the below figure).

PICRUSt prediction

Prediction of 16S copy numbers using PICRUSt algorithm

Implementation in EzBioCloud 16S-based MTP app

EzBioCloud 16S-based MTP app allows you to instantly and interactively apply 16S copy number correction to the comparative analysis of multiple samples, the calculation of beta-diversity,  and to Biomarker Discovery (e.g., LefSe) as well. Our database of 16S copy number is more comprehensive than any other database as we utilize an up-to-date version of the genome database (8,631 quality-controlled genomes of 3,302 species; as of March 2018).

The EzBioCloud team / Last edited on Mar. 6, 2018


Have a Question? Let's have a chat?

We're here to answer any question you might have


Stay up to date

Keep up with our latest developments


Have a Question? Let's have a chat?

We're here to answer any question you might have