MTP-Primary analysis User Guide
Microbiome Taxonomic Profiling (MTP) – Basic Concept
What is Microbiome Taxonomic Profiling (MTP)?
One of the major goals of microbiome analyses is to obtain the taxonomic profile of a sample. The most widely used and cost-effective method is to sequence PCR amplicons of a phylogenetic marker gene. For Bacteria and Archaea, the 16S rRNA gene is generally chosen, whereas the ITS gene is used for fungal taxonomic profiling. In EzBioCloud, 16S-based Microbiome Taxonomic Profiling is a cloud app that allows users to generate taxonomic profiles from NGS data and easily group and compare the profiles from different samples.
How does this profiling work?
The below is the general procedure of EzBioCloud’s 16S-based MTP app:
- NGS raw data (as FASTQ or FASTA format) are uploaded to www.ezbiocloud.net. Our MTP pipeline will automatically process your data which are converted to a data unit called an MTP. An MTP represents single metagenomic or microbiome sample. In addition, data from public sources including the Human Microbiome Project and Short Read Archive (SRA) have been processed in advance, so they can be grouped and compared with your own MTPs. Each MTP contains information about run QC such as read length and number reads matched. You also get alpha-diversity statistics along with taxonomic hierarchy and composition which can be explored interactively with the visualizations in EzBioCloud 16S-based MTP app.
- Your own MTPs or those from the public domain database can then be grouped into MTP sets for comparisons. The best way to group samples is to use metadata tags [Learn more].
- Two or more MTP sets are then compared for beta-diversity analytics or biomarker discovery, in what we call secondary analysis. For example, you may find differentially present bacterial species between the 30 healthy and 35 obese human subjects. This task would take <10 seconds using EzBioCloud 16S-based MTP app. Of course, you can change the statistical algorithm and parameters, then run again instantly and interactively.
The EzBioCloud team / Last edited on Feb. 8, 2019
Uploading NGS data into the EzBioCloud 16S-based MTP app
If you are an EzBioCloud user, you can upload NGS data to the Microbiome Taxonomic Profiling (MTP) pipeline. This is the only way of adding your own data into the EzBioCloud 16S-based MTP app. If you are not ready for being a paid user, please don’t worry. If you are an Academic/Non-profit user, you can upload up to 100 microbiome samples with up to 100,000 NGS reads. [Learn more].
MTP Upload Center
- Choose an NGS platform used to generate data.
- Choose a taxonomy database. At present, there are three databases to choose from.
- PKSSU4.0: This is an up-to-date version for the prokaryotic 16S database (Bacteria+Archaea).
- mtpdb_v1.5: This is a previous version for the prokaryotic 16S database which is provided for the compatibility with old data. If you are new to our service, please choose PKSSU4.0.
- ATCCSTD1.0: This is a 16S database for analyzing the ATCC® Microbiome Standards.
- Choose a target taxon.
- The PCR primer sequences in the NGS data may not be accurate as they are the product of annealing. Therefore, we recommend them to be excluded or trimmed. If you want EzBioCloud 16S-based MTP app to trim them, you can select the primer information to be trimmed from the preset or check ‘Custom’ and enter your own primer sequences. Check ‘None’ if you don’t want to trim them or already trimmed the primer sequences.
- Upload NGS files here. You can upload the files directly or upload the data via links like ‘Google drive’, ‘Dropbox’, ‘pCloud’. For the paired-end data, both FASTQ files should be uploaded. For PacBio’s ccs data, please upload the processed FASTA data. At present, we will not accept the PacBio’s raw data. If you upload the FASTQ file(s), we will use the quality score data included in the FASTQ format. Files can be compressed by a gzip software (.gz extension) to speed up the uploading process.
- Show the status information of your pipeline runs. Please be patient if you are trying the free version since it might take some time. A typical run-time for an MTP for the paid users is less than 30 min.
Make sure that you upload 16S amplicon data generated for Bacteria and Archaea.
Here are some primer sets that are most widely used for microbiome analysis:
- Illumina paired-end sequencing
- V3V4 (ChunLab): 341F = CCTACGGGNGGCWGCAG | 805R = GACTACHVGGGTATCTAATCC
- V4 (Earth Microbiome Project): 515FB = GTGYCAGCMGCCGCGGTAA | 806RB = GGACTACNVGGGTWTCTAAT
- PacBio full-length sequencing
- 27F = AGRGTTTGATYMTGGCTCAG | 1492R = GGYTACCTTGTTACGACTT
The EzBioCloud team / Last edited on Sept. 19, 2019
Microbiome Taxonomic Profiling Pipeline used in EzBioCloud
The primary objective of analyzing 16S amplicon sequencing data is to profile sequencing reads into a known taxonomic structure. The figure below illustrates the current bioinformatics pipeline used in the EzBioCloud 16S-based MTP app.
Raw data that is uploaded to the pipeline can be any one of the following:
- Single-end reads generated by Roche 454
- Single or paired-end reads generated by Illumina platforms (MiSeq, HiSeq etc.)
- CCS (circular consensus sequencing) reads generated by Pacific Biosciences (PacBio) platforms. Please consult PacBio’s manual for how to generate CCS reads.
- Any other NGS platforms that can generate FASTQ or FASTA format outputs.
|Merging paired-end reads:|
|In the case of paired-end sequencing (typically MiSeq 250 bp x 2), two sequences representing each end of the same PCR amplicon are merged using the overlapping sequence information. For single-end or CCS sequencing, this step is not required. Those reads that can not be merged are omitted from the subsequent steps. VSEARCH program (Rognes et al. 2016) is used, if applicable.|
|When PCR amplicons are sequenced, primers used for PCR are not considered “sequenced.” These regions for primers are from the annealing process, rather than direct sequencing. Therefore, our pipeline removes primer sequences that were used for PCR of 16S. An in-house code is used for processing.|
|Filtering by quality:|
|Even though present-day NGS machines produce high-quality sequences, sequences with low quality can be also generated. We applied several measures to detect and filter out the sequences with low quality [Learn more].|
|Extracting non-redundant reads:|
|The identical sequences are de-replicated in this step to reduce computational time.|
|Dereplicated sequences are then subjected to taxonomic assignment. We use VSEARCH program (Rognes et al. 2016) to search and calculate sequence similarities of the query NGS reads against the EzBioCloud 16S database. 97% 16S similarity is used as the cutoff for species-level identification. Other sequence similarity cut-offs are used for genus or higher taxonomic ranks.|
To reduce computation and accuracy, we built different versions of reference 16S databases that match various regions of 16S sequences. For example, full-length version (V1-V9) is used for PacBio ccs data whereas the V3-V4 version is used for MiSeq 250 bp paired-end sequencing data.
|We assume that NGS sequencing reads which match the reference sequences in EzBioCloud database are not chimeric. Only the remaining reads are checked for chimera using the UCHIME program [Learn more].|
|OTU (operational taxonomic unit) is a widely used term in microbiome research and can be regarded as “species” [Learn more]. All sequences from a sample can be clustered into many OTUs using different algorithms and software tools. Rideout et al. (2014) evaluated three algorithms (de novo, close-reference and open-reference). EzBioCloud 16S-based MTP pipeline adopted “open-reference” method with the following four steps:|
|Estimating alpha diversity indices:|
|Using OTU information (number of OTUs and sequences in each OTU), various alpha diversity indices can be calculated. These include species richness, Shannon and Simpson diversity indices, and many more.|
|Secondary analysis using EzBioCloud 16S-based MTP app:|
|Once all calculations are carried out for a single microbiome sample in the EzBioCloud 16S-based MTP pipeline, all the information about that sample is saved as an object named Microbiome Taxonomic Profile (MTP). EzBioCloud 16S-based MTP app is installed on the Amazon Cloud, and you use the EzBioCloud web-based user-interface, to run comparative analysis and data-mining on sets of MTPs of your own choice. This process is called “secondary analysis”. Typical secondary analyses require only a few mouse clicks and you have the results in seconds.|
The EzBioCloud team / Last edited on Feb. 19, 2018
Quality-filtering for 16S Microbiome Taxonomic Profiling
For 16S microbiome taxonomic profiling, the following criteria are used to filter out sequencing reads with low quality:
- Sequences with the lengths of <100 bp or >2,000 bp
- Averaged Q value is <25.
- Not predicted as a 16S gene by the Hidden Markov Model (HMM) based search.
- Sequences are first assigned to the reference 16S database. All sequences that do not match any of reference sequences with at least 97% similarity cutoff are clustered using UCLUST method using 97% the cutoff. If a sequence is found to be a singleton, we assume that it is an erroneous one that should be excluded in the subsequent analyses. This algorithm is widely used, especially for Illumina short read sequencing [See QIIME manual’s step 5].
The EzBioCloud team / Last edited on Feb. 19, 2018
What is the chimera?
According to Greek mythology, the Chimera is a monstrous fire-breathing hybrid creature of Lycia in Asia Minor, composed of parts from more than one animal. Here, we define the chimera as an artifactual PCR product/amplicon generated erroneously from more than one DNA template. It is a well-known fact that chimeras are inevitable when preparing amplicon sequencing libraries for NGS. It is therefore important to detect and filter them out before any types of microbiome analyses.
Mechanism of chimera formation
PCR involves multiple cycles of (i) denaturation of DNA templates by heat to generate the single-stranded DNA templates, (ii) annealing of the primers to each of the DNA templates and (iii) extension/elongation by DNA polymerase. The major cause of the chimera formation is an aborted extension product from an earlier cycle of PCR which can act as a primer in a subsequent PCR cycle. If this aborted extension product anneals to and primes DNA synthesis from an incomplete template, a chimeric PCR product is formed (see the below figure).
The ratio of chimeras in PCR reactions varies depending on the DNA polymerase used, PCR conditions, and the product size and diversity of the DNA templates. Hass et al. (2011) reported that 15~20% chimeras were detected for 454 sequencing of 16S.
How to detect the chimeras
There are two major approaches to detecting chimeras in NGS-based amplicon data.
(1) Reference-dependent detection: As shown in the above example, each end of the PCR product matches to the strains A and B, respectively. However, as a whole sequence, it would not match to either strain A or B with high similarity. If we know the exact sequences of strains A and B, and there are substantial differences between two, we should be able to figure out that this chimeric product did not come from a single strain but from both strains. Using this principle, a large number of NGS reads can be screened for chimeras using a well established trusted, non-chimeric reference database. Needless to say, the quality of the reference chimera-free database is the key to success in this case. UCHIME and ChimeraSlayer provide this algorithm.
(2) De novo detection: In this algorithm, a chimera-free reference database is automatically generated for each NGS data. Initially, the reference database is empty. Then, NGS reads are considered in the order of decreasing abundance. If a sequence is classified as chimeric, it is discarded; otherwise, it is added to the reference database (so the size of the reference database grows). Candidate parents (PCR templates, strains A and B in the previous figure) are required to have more abundance than that of the query sequence, on the assumption that a chimera has undergone fewer rounds of amplification and will, therefore, be less abundant than its parents (Edgar et al., 2011). UCHIME provides this algorithm.
Because there is a huge amount of full-length 16S sequences available, reference-dependent detection has been mostly used in recent studies, particularly for the human microbiome. UCHIME, as implemented in QIIME and MOTHUR packages, is most widely used and has been cited many times.
Example of chimera
The following sequence from a human skin microbiome sample was generated by a Roche 454 instrument.
This sequence is identified as a chimera by UCHIME algorithm as:
- The left part of the sequence matches to 99.7% to Staphylococcus epidermidis (Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus)
- The right part of the sequence matches to 100% to Propionibacterium acnes (Actinobacteria;Actinobacteria;Propionibacteriales;Propionibacteriaceae;Propionibacterium)
Please note that two parents belong to different phyla (Firmicutes and Actinobacteria). Try this by yourself using EzBioCloud’s [Identify] service at //www.ezbiocloud.net/identify; copy the left and right half of the above sequence and use them to identify the left and right parts of this 454 sequence.
Chimera detection in EzBioCloud 16S-based MTP app
EzBioCloud 16S-based MTP app uses the UCHIME and manually curated chimera-free reference database. EzBioCloud’s chimera-free reference database contains:
- Sequences from pure cultures.
- Full-length sequences of uncultured organisms that are confirmed to be genuine. This includes >2,000 sequences generated by PacBio CCS technology which were recovered in the repetitious PCR reactions of the same sample or different samples. Please note that chimera formation is thought to be random, so repetitive recoveries of the same sequence from different PCR reactions is a fair indication of non-chimeric reads. The quality of sequences in the chimera-free reference database was further checked manually using secondary structure modeling of 16S rRNA molecules.
The following figure illustrates the workflow for chimera detection in the EzBioCloud 16S-based MTP pipeline. The query NGS sequence data is the first subject to taxonomic assignment to the EzBioCloud 16S database. If a sequence matches to a reference sequence with >97% similarity, it is assigned to a species, but also not labeled as a chimera, as the EzBioCloud 16S database is also checked by a rigorous quality control process that includes chimera detection. The remaining query NGS reads are checked by the UCHIME tool. Because of high coverage of our chimera-free reference database in human and mouse microbiomes, we believe that chimeras that escape this process are minimal, particularly for human/mouse microbiome samples.
Notes on chimera detection
- Chimera detection is a very important step in the microbiome analysis as the unchecked chimeras will be noted as a novel species. Together with erroneous sequences, chimeras will falsely increase the number of species/OTUs detected. Consequently, this will affect the accuracy of alpha-diversity indices by overestimating them.
- There is no way to detect all chimeras. However, the efficiency of the chimera removal process can be greatly improved by the quality and coverage of a chimera-free reference database.
The EzBioCloud team / Last edited on Feb. 3, 2019
Browsing a single MTP
Here, we will go through how to browse and get the necessary information about a single Microbiome Taxonomic Profile (MTP). We will use example data which was generated using a whole run of Illumina MiSeq 250 bp paired-end sequencing. The sample is from Dr. Jon Jongsik Chun who is the founder and CEO of ChunLab, Inc. This data was analyzed using the EzBioCloud 16S-based Microbiome Taxonomic Profiling (MTP) app and can be browsed freely, without a login at [Explore a deep-sequenced human fecal sample].
This web-page consists of several tabs each with different aspects of a single MTP sample.
[About MTP] – tab
- This is your MTP/sample name. It is entered when you upload the NGS raw data to the EzBioCloud 16S-based MTP app and can be edited later.
- This is a memo field where you can store comments.
- Sample information is organized by metadata tags in EzBioCloud 16S-based MTP app. Apply tags according to your specific needs. Tags can be a great tool when grouping samples/MTPs into sets for subsequent comparisons and secondary analysis.
- Target taxon can be one of [Bacteria] and [Archaea]. It should be decided when the data is uploaded to the cloud.
- The EzBioClouds 16S-based MTP pipeline can be run using different versions of reference database. PKSSU4.0 stands for EzBioCloud’s prokaryotic small subunit (16S) rRNA database 4.0 which was released in March 2018.
- Region can be one of [V1V9], [V4], [V3V4], [V1V3], [V3V5], [V6V9], and [SHOTGUN]. It was be decided after the data was analyzed in the cloud.
- To edit an MTP name and/or the memo field, click the [Edit] button. This function is only supported in My data.
- To download the taxonomic profile of this MTP, click the [Profile] button. A profile file contains full information about read-counts for each taxon (from phyla to species). We also provide the copy-number-corrected counts [Learn more]. This function is only supported in My data.
About read counts
- This indicates the number of sequencing reads from the uploaded raw NGS data, minus the reads that do not overlap in paired-end sequencing.
- Several quality measures are applied to filter out low-quality [Learn more], non-target, and chimeric reads [Learn more]. Only the remaining sequencing reads, called “Valid reads” are used for the subsequent microbiome analyses.
- Removed sequences are further classified here. You may note that >2.5M reads were detected as chimeric amplicons, which may surprise you. This high level of chimera detection is likely due to the fact that our non-chimeric reference database covered better for this particular sample. In the latest version, we added >2,000 full-length high-quality 16S sequences derived from human and mouse gut microbiomes.
About read lengths
In this section, statistics (min, max, and average) about the lengths of valid reads are given.
About taxonomic assignment
- This is the percent of quality-controlled sequencing reads that were assigned at the species level. In this case, >5.9 million reads were assigned at the species level, or 92.8%. Using the EzBioCloud 16S database, the taxonomic coverage of human microbiome samples ranges from 95 to 98% [Learn more].
- This is the number of species that were actually detected in this MTP. Because very deep sequencing has been done on this sample, we have found a relatively high number of species. If the same sample is sequenced around 10,000 reads, this would be less than 500.
[Alpha-Diversity] – tab
- OTU-picking method used. CL_OPEN_REF_UCLUST_MC2 is an open-reference method in which de novo clustering is carried out using the UCLUST program and single-membered de novo clusters are ignored [Learn more].
- Cutoff used for taxonomic assignment at the species level and for de novo clustering.
- The number of OTUs found=the number of species and OTUs from de novo clustering. Because 92.8% of reads were assigned to 840 species, most of the remaining 7.2% of reads constitute >28,000 OTUs. Even though many single-membered OTUs were discarded, it seems that there is an over-estimate of OTUs probably due to sequencing errors. Just because there are two or more identical reads in this deep-sequenced data, does not guarantee that the sequences are real.
- Extreme deep-sequencing with 64M reads captured almost all of the species diversity in this fecal sample, resulting in 100% Good’s coverage of library.
Various alpha-diversity indices can be used to explain biodiversity of an MTP:
- Species richness indices (ACE, Chao1, Jackknife) try to estimate the number of species/OTUs in an MTP.
- Diversity indices (Shannon, Simpson, NPShannon, Phylogenetic Diversity) are mathematical measurements of species diversity or evenness in an MTP. LCI, low confidence interval; HCI, high confidence interval.
Also, a rarefaction curve and rank abundance plot are provided.
[Taxonomic hierarchy] – tab
On this tab, you can do the following:
- Browse the taxonomic structure of an MTP in a hierarchical manner
- Download sequences (as FASTA format) or copy individual sequences to the Clip Board
Let’s explore data to see what species of the genus Faecalibacterium are present in this MTP. Faecalibacterium is one of the most important human gut taxa and is thought to be beneficial, as it produces short chain fatty acids from dietary fibers.
Click here to view its taxonomic hierarchy.
In [Taxonomic hierarchy] tab, select and expand Firmicutes → Clostridia → Clostridiales → Ruminococcaceae → Faecalibacterium.
- Click “Faecalibacterium prausnitzii group” to reveal the other species and phylotypes that are included in this taxonomic group. Species and phylotypes that are indistinguishable from each other are classified into taxonomic groups in the EzBioCloud 16S-based MTP app [Learn more].  indicates the number of sequencing reads assigned in this taxon.
- Species/phylotypes included in the “Faecalibacterium prausnitzii group” are listed here. Click the name to open the webpage with its taxonomic information. Note that FP929045_s, NMTZ_s, GG697149_s and GL538271_s are phylotypes that were supported by genomic evidence [Learn more].
- The second most abundant species, PAC001430_s, is represented by a full-length PacBio sequence.
- To download all sequences in the selected taxa, click here. This function is not supported in the guest mode.
- To expand all taxa at once, select the taxonomic rank and click [Expand].
- Click “Faecalibacterium prausnitzii group” to view its taxonomic hierarchy.
Formally, the genus Faecalibacterium has only one species, Faecalibacterium prausnitzii, with a valid name. Here, the combination of a taxonomically validated EzBioCloud database and a sensitive taxonomic assignment algorithm allow the elucidation of detailed profiling at the species level. Because many identified phylotypes are represented by whole genome sequences, a further in-depth functional investigation is possible using comparative genomics.
[Taxonomic composition] – tab
On this tab, taxonomic compositions at various ranks (from phylum to species) are given as pie charts and tables. The charts and tables can be exported or downloaded for immediate use in reports and publications.
In the species composition table, use the <filter> to quickly search for abundances of taxa that you are interested in. For example, entering “Bacteroides” will show you all species that include the term “Bacteroides” (See the below screenshot). Please note that this will not reveal the phylotypes in the genus Bacteroides.
[Selected taxa] – tab
In this tab, you can explore the abundances of any taxa. We also provide several predefined taxa (subject to change):
- Lactic acid bacteria (LAB): This term is not a taxonomic one, as it refers any bacteria that are capable of producing lactic acids. Traditionally, LAB is used for the probiotic strains which are now classified in the genera Lactobacillus, Leuconostoc, Lactococcus, Weissella, and Bifidobacterium.
- Firmicutes to Bacteroidetes ratio: Firmicutes and Bacteroidetes are two major phyla in human gut microbiota. Firmicutes to Bacteroidetes ratio (F/B) has been used as a biomarker indicating the healthy state of a person. The F/B has been shown to be correlated with obesity in many studies.
- Human gut taxa: Several taxa at various ranks are predefined for human microbiome study.
- Select [Human gut taxa]
- [F] indicates the taxonomic rank, “family”, of Ruminococcaceae. Similarly, [P] for phylum, [C] for class, [O] for order, [G] for genus, [S] for species.
- The bar indicates the abundance (18.56%).
The abundance of any taxa which are not included in these predefined lists can be found by entering the name of a taxon into [Search taxa]. For example, enter “Escherichia coli group” to get the abundance of this taxonomic group.
[Krona] – tab
Under the “Krona” tab, taxonomic compositional data are loaded onto the Krona tool, which is an open source visualization project available here. This tool is developed by Ondov et al. (2011) and provides an interactive means of exploring the data. Nice figures can be captured here, for publication or presentation purposes.
[Word Cloud] – tab
It is a quick and easy way of visualizing major taxonomic groups at any ranks (from phyla to species) for a single MTP. The Word Cloud is an image composed of words used in a particular text or subject, in which the size of each word indicates its frequency or importance. The below World Cloud images are of phyla and genera from a human gut microbiome. Obviously, the genus Faecalibacterium belongs to the phylum Firmicutes.
[MTP 2.0] – tab
- Select the type of the chart, pie chart or stacked bar.
- Select the taxonomic rank of the inner circle.
- Select the taxonomic rank of the outer circle.
- Display internal rank(s) between the inner circle and the outer circle.
- Set the default angle at which the graph begins.
- Sort the items of the graph in counterclockwise (CCW) order.
- (Score) All reads that are present in minor quantity are classified as ETC. You can change the cut-off for the ETC reads here. (Index) Only the taxon of the entered number is shown. Otherwise, it is classified as ETC. (None) Show all taxon.
- You can search the taxon you want to find.
- Information on the current taxonomic hierarchy is shown.
- Click or mouseover to get information about the taxon.
- Display the taxonomic composition of this MTP sample. You can sort by the taxonomic name or taxonomic composition size.
- You can select the table to sort by row or column.
The EzBioCloud team / Last edited on Feb 8, 2019