Microbiome Taxonomic Profiling Pipeline used in EzBioCloud
The primary objective of analyzing 16S amplicon sequencing data is to profile sequencing reads into a known taxonomic structure. The figure below illustrates the current bioinformatics pipeline used in the EzBioCloud 16S-based MTP app.
Raw data that is uploaded to the pipeline can be any one of the following:
- Single-end reads generated by Roche 454
- Single or paired-end reads generated by Illumina platforms (MiSeq, HiSeq etc.)
- CCS (circular consensus sequencing) reads generated by Pacific Biosciences (PacBio) platforms. Please consult PacBio’s manual for how to generate CCS reads.
- Any other NGS platforms that can generate FASTQ or FASTA format outputs.
|Merging paired-end reads:|
|In the case of paired-end sequencing (typically MiSeq 250 bp x 2), two sequences representing each end of the same PCR amplicon are merged using the overlapping sequence information. For single-end or CCS sequencing, this step is not required. Those reads that can not be merged are omitted from the subsequent steps. PANDAseq software (Masella et al., 2012) is used, if applicable.|
|When PCR amplicons are sequenced, primers used for PCR are not considered “sequenced.” These regions for primers are from the annealing process, rather than direct sequencing. Therefore, our pipeline removes primer sequences that were used for PCR of 16S. An in-house code is used for processing.|
|Filtering by quality:|
|Even though present-day NGS machines produce high-quality sequences, sequences with low quality can be also generated. We applied several measures to detect and filter out the sequences with low quality [Learn more].|
|Denoising and extracting non-redundant reads:|
|In general, NGS raw data contains ~0.5% sequencing errors, which occur randomly. Since the same gene is sequenced many times over in microbiome sequencing, we can correct these sequencing errors with adequate error modeling. This process is called “denoising” and we use new software called DUDE-Seq. The identical sequences are de-replicated in this step to reduce computational time.|
|Denoised and dereplicated sequences are then subjected to taxonomic assignment. We use VSEARCH program to search and calculate sequence similarities of the query NGS reads against the EzBioCloud 16S database. 97% 16S similarity is used as the cutoff for species-level identification. Other sequence similarity cut-offs are used for genus or higher taxonomic ranks.
To reduce computation and accuracy, we built different versions of reference 16S databases that match various regions of 16S sequences. For example, full-length version (V1-V9) is used for PacBio ccs data whereas the V3-V4 version is used for MiSeq 250 bp paired-end sequencing data.
|We assume that NGS sequencing reads which match the reference sequences in EzBioCloud database are not chimeric. Only the remaining reads are checked for chimera using the UCHIME program [Learn more].|
|OTU (operational taxonomic unit) is a widely used term in microbiome research and can be regarded as “species” [Learn more]. All sequences from a sample can be clustered into many OTUs using different algorithms and software tools. Rideout et al. (2014) evaluated three algorithms (de novo, close-reference and open-reference). EzBioCloud 16S-based MTP pipeline adopted “open-reference” method with the following three steps:
|Estimating alpha diversity indices:|
|Using OTU information (number of OTUs and sequences in each OTU), various alpha diversity indices can be calculated. These include species richness, Shannon and Simpson diversity indices, and many more.|
|Secondary analysis using EzBioCloud 16S-based MTP app:|
|Once all calculations are carried out for a single microbiome sample in the EzBioCloud 16S-based MTP pipeline, all the information about that sample is saved as an object named Microbiome Taxonomic Profile (MTP). EzBioCloud 16S-based MTP app is installed on the Amazon Cloud, and you use the EzBioCloud web-based user-interface, to run comparative analysis and data-mining on sets of MTPs of your own choice. This process is called “secondary analysis”. Typical secondary analyses require only a few mouse clicks and you have the results in seconds.|
The EzBioCloud team / Last edited on Feb. 19, 2018