Microbiome Taxonomic Profiling Pipeline used in EzBioCloud

The primary objective of analyzing 16S amplicon sequencing data is to profile sequencing reads into a known taxonomic structure. The figure below illustrates the current bioinformatics pipeline used in the EzBioCloud 16S-based MTP app.

Raw data that is uploaded to the pipeline can be any one of the following:

  • Single-end reads generated by Roche 454
  • Single or paired-end reads generated by Illumina platforms (MiSeq, HiSeq etc.)
  • CCS (circular consensus sequencing) reads generated by Pacific Biosciences (PacBio) platforms. Please consult PacBio’s manual for how to generate CCS reads.
  • Any other NGS platforms that can generate FASTQ or FASTA format outputs.

Pipeline for Microbiome Taxonomic Profiling used in EzBioCloud

Merging paired-end reads:
 In the case of paired-end sequencing (typically MiSeq 250 bp x 2), two sequences representing each end of the same PCR amplicon are merged using the overlapping sequence information. For single-end or CCS sequencing, this step is not required. Those reads that can not be merged are omitted from the subsequent steps. VSEARCH program (Rognes et al. 2016) is used, if applicable.
Trimming primers:
 When PCR amplicons are sequenced, primers used for PCR are not considered “sequenced.” These regions for primers are from the annealing process, rather than direct sequencing. Therefore, our pipeline removes primer sequences that were used for PCR of 16S. An in-house code is used for processing.
Filtering by quality:
 Even though present-day NGS machines produce high-quality sequences, sequences with low quality can be also generated. We applied several measures to detect and filter out the sequences with low quality [Learn more].
Extracting non-redundant reads:
The identical sequences are de-replicated in this step to reduce computational time.
Taxonomic assignment:
Dereplicated sequences are then subjected to taxonomic assignment. We use VSEARCH program (Rognes et al. 2016) to search and calculate sequence similarities of the query NGS reads against the EzBioCloud 16S database. 97% 16S similarity is used as the cutoff for species-level identification. Other sequence similarity cut-offs are used for genus or higher taxonomic ranks.

  • = sequence similarity to reference sequences; species (x ≥ 97%), genus (97> x ≥94.5%), family (94.5> x ≥86.5%), order (86.5> x ≥82%), class (82> x ≥78.5%), and phylum (78.5> x ≥75%). Cutoff values are taken from Yarza et al. (2014).

To reduce computation and accuracy, we built different versions of reference 16S databases that match various regions of 16S sequences. For example, full-length version (V1-V9) is used for PacBio ccs data whereas the V3-V4 version is used for MiSeq 250 bp paired-end sequencing data.

Detecting chimeras:
We assume that NGS sequencing reads which match the reference sequences in EzBioCloud database are not chimeric. Only the remaining reads are checked for chimera using the UCHIME program [Learn more].
Picking OTUs:
OTU (operational taxonomic unit) is a widely used term in microbiome research and can be regarded as “species” [Learn more]. All sequences from a sample can be clustered into many OTUs using different algorithms and software tools. Rideout et al. (2014) evaluated three algorithms (de novo, close-reference and open-reference). EzBioCloud 16S-based MTP pipeline adopted “open-reference” method with the following four steps:

  1. All quality controlled query sequences are matched to EzBioCloud 16S database to achieve the species level identification (97% cutoff).
  2. The sequences that are not matched by 97% are then clustered using UCLUST tool with 97% similarity boundary. An OTU is defined as a group of clusters.
  3. The species identified in step (a) and OTUs obtained by step (b) are combined to become the final set of OTUs. This information is later used for calculating alpha diversity indices.
  4. Any remaining singletons are ignored in the OTU picking process. This is particularly important for Illumina short reads, which may over-estimate the number of OTUs [Learn more].
Estimating alpha diversity indices:
Using OTU information (number of OTUs and sequences in each OTU), various alpha diversity indices can be calculated. These include species richness, Shannon and Simpson diversity indices, and many more.
Secondary analysis using EzBioCloud 16S-based MTP app:
Once all calculations are carried out for a single microbiome sample in the EzBioCloud 16S-based MTP pipeline, all the information about that sample is saved as an object named Microbiome Taxonomic Profile (MTP).  EzBioCloud 16S-based MTP app is installed on the Amazon Cloud, and you use the EzBioCloud web-based user-interface, to run comparative analysis and data-mining on sets of MTPs of your own choice. This process is called “secondary analysis”. Typical secondary analyses require only a few mouse clicks and you have the results in seconds.

The EzBioCloud team / Last edited on Feb. 19, 2018