This tutorial was prepared using a well-known pathogenic species, Vibrio cholerae, (the causal agent of cholera). It is a step-by-step example showing how to use the comparative genomics functions in EzBioCloud
About Vibrio cholerae
Vibrio cholerae is a bacterial species that causes a disease called cholera in humans. Cholera is an acute diarrheal disease that can kill within hours if left untreated (See WHO site for more details). However, not all strains of V. cholerae cause cholera. The disease is mainly caused by strains that contain the cholera toxin (CTX), which is coded in two cholera toxin genes, ctxA and ctxB. These strains which contain the ctxAB genes are considered “toxigenic”. Interestingly, these deadly genes reside in a virus called ctx bacteriophage (CTXφ), which can be transferred from strain to strain via a major mechanism of lateral gene transfer called temperate bacteriophage-mediated gene transfer.
Cholera is so widespread throughout Asia and Africa that it is known as a “cholera pandemic”. The world is still in the midst of a cholera pandemic, and the current 7th cholera pandemic is caused by a group of clonal strains known as serotype O1 biovar El Tor (or O1 El Tor for short). The previous pandemic (6th) was caused by serotype O1 biovar Classical (or O1 Classical). The reason for this clonal shift (from O1 Classical to O1 El Tor) is not understood very well, but it is interesting to note that O1 serotype/Classical biotype strains are no longer found in any part of the world, even though they were isolated all over the world during the 6th cholera pandemic.
Before getting started
- Before starting this tutorial, we suggest installing a desktop computer program for manipulating dendrograms and phylogenetic trees as this will help improve the quality of the presentation. We recommend MEGA program for both MS Windows and Mac OS X.
- You should have an account at ezbiocloud.net.
- Go to [Apps], then choose [CG, Comparative Genomics]
- Click on [Set] at the Tutorial on the “Vibrio cholerae”. It is at the left side of the page.
From here, we will explain the major features of EzBioCloud’s comparative genomics service using this data set.
About V. cholerae strains
|N16961||O1 El Tor||7th Pandemic|
|TSY216||O1 El Tor||7th Pandemic|
|IEC224||O1 El Tor||7th Pandemic|
|2010EL-1786||O1 El Tor||7th Pandemic|
|B33||O1 El Tor||7th Pandemic|
|O395||O1 Classical||6th Pandemic|
|ATCC 14035 (Type strain of the species)||O1 Classical||6th Pandemic|
|TM 11079-80||O1 El Tor|
|LMA3984-4||O1 El Tor|
|12129(1)||O1 El Tor|
Quickly browsing data set
This data set contains 16 genomes of V. cholerae and they are a combination of strains with different serotypes, biotypes, and toxigenicity from clinical and environmental sources.
- Go to [DATE SET]→[Genomes in This Data Set] to browse information on strains (=metadata). We try to collect and validate as much metadata as possible from public databases and literature.
Exploring individual genomes
To explore individual genome data, you need to open EzBioCloud’s “Genome Explorer”. Here, you can access the genome of strain MO10 by clicking the [Open] button ((a) in the screenshot below). This strain is a Vibrio cholerae isolate belonging to serotype O139, isolated in Madras, India in 1992. The genome sequence was not completely determined and consists of 27 contigs. If you want to know how to explore a single genome, please proceed to the tutorial for the Genome Explorer.
The first step in the comparative genomic analysis is to look at the phylogenetic relationship among strains in detail so that we know which strains are more closely related than others.
- Go to [PHYLOGENOMICS]→[OrthoANI]
OrthoANI is a similarity measurement between two genome sequences. Hierarchical clustering (usually UPGMA) based on OrthoANI values is one of the best ways to infer phylogenetic relationships among multiple genomes/strains.
*Please note that more information was added to the figure below using the MEGA program. The 6th and 7th pandemic strains are those which killed millions of people over the last several decades.
- Go to [PHYLOGENOMICS]→[Tetra-nucleotide]
Tetra-nucleotide analysis (TNA) utilizes frequencies of tetra-nucleotides in the genome sequences, thus it’s not exactly a phylogenetic method nor an actual sequence alignment. However, it can provide additional valuable information on how genomes are related. The figure represents the UPGMA clustering of V. cholerae strains based on tetra-nucleotides compositions.
- This strain is quite different from the others in the composition of tetra-nucleotides. We will find out why later on.
What makes pandemic strains so successful in conquering the world?
This is probably the first question we should consider. We can assume that only pandemic strains contain genes that are responsible for enhanced infectivity and/or toxigenicity. Therefore, they must have won the competition over other V. cholerae strains in nature thanks to those genes. Let’s find out together if this is true!
- Open [PAN-GENOME]→[Identify Differentially Present POGs/Pathways]
- Select all pandemic strains from 6th and 7th Pandemic from the center panel and move them to the left panel (Group #1). Move the remaining strains to the right panel (Group #2) and click [Fisher’s exact test]. You can use this statistical method when you have a sufficient number of genomes to consider. If you have only a few genomes to compare, use [Exclusive OR (XOR)] instead, which is a simpler algorithm to find differentially present POGs.
In EzBioCloud, all protein-coding genes (CDSs) are compared or compiled to generate a “pan-genome”, which consists of a non-redundant set of orthologous CDSs called pan-genome orthologous groups (POGs).
Here, two outputs are provided:
- Differentially present orthologous groups: Genes that are differentially present between two groups of genomes.
- Pathway enrichment: Once such orthologous groups are identified, we can further proceed to look for differentially present KEGG metabolic pathways.
- Representative CDS name of POG
- p-value from the Fisher’s exact test. The smaller the value, the more significant.
- Blue color means that this gene is present in the respective genome.
- Red color means that this gene is absent in the respective genome.
Using the p-value of Fisher’s exact test, you can find and browse genes with the most differences. Doing this, you can see that there are many genes that are only present in the pandemic strains. Some notable DPGs are ctxA, ctxB (genes coding for cholera toxin) and tcpA (genes coding for pillin which is co-regulated with toxin). These genes have shown to be of premium importance in cholera pathogenicity. It looks like our analysis found the right answer to the previous question of “What makes pandemic strains?”
Our next question to consider is whether there are any differences in the larger scale functional aspects (e.g. metabolic pathways) between pandemic and non-pandemic groups. You can find the answer by opening the [Pathway Enrichment] tab.
Again, the p-value is the key. The four most significant pathways (of KEGG database) under a p-value of 0.05 are:
- Vibrio cholerae pathogenic cycle (related to disease)
- Vibrio cholerae infection (related to disease)
- Lipopolysaccharide biosynthesis (related to serotype and antigenicity/immunity)
- Pyruvate metabolism (related to metabolism)
These analyses make sense except for “Pyruvate metabolism”, which requires further investigation.
How many genes are shared among strains?
It is well known that two strains belonging to the same species do not completely share their genes. In other words, some genes are shared by all strains whereas others are not. There are also genes that are only present in one strain, and not in the others. This aspect of genome data can be explored via [PAN-GENOME]→[Venn Diagram].
- Select 4 genomes (N16961, TSY216, IEC224, 2010EL-1786) and click on [Draw]. As you can see from the Venn Diagram, these strains are closely related in genome sequences (by high OrthoANI values). If you want to check these values, go to [PHYLOGENOMICS]→[OrthoANI] and download the CSV file containing pairwise ANI values.
- Please note that strain TSY216 has many extra genes!
3,412 genes (CDSs) are shared among the four strains and only ~100 genes are solely present in any single genome (called singletons). Surprisingly, TSY216 has 1,034 extra genes that are not found in other strains. Because these strains are very similar in their genome sequences, we have good reason to believe that these four strains share a very recent common ancestor. The only explanation for having so many extra genes is a large-scale gene transfer from outside the cell. In this case, this Thailand isolate contains a gigantic plasmid with a size of ~800,000 bp (Okada et al., 2015).
Whole genome alignment
Pairwise genome-to-genome alignment can give us plenty of information about the structural variation among genomes. By “Venn Diagram” analysis, we found that TSY216 has 1,034 extra genes in its genome. Let’s visualize the structural differences between this strain and other typical 7th pandemic V. cholerae strains.
- Go to [PAIRWISE ANALYSIS]→[Whole genome alignment using NUCMER]
- Select “N16961” as “Reference genome”
- Select “TSY216” as “Query genome”, then click [Run].
- TSY216 has an extra DNA (it’s a megaplasmid) that has no similarity to the reference genome (=N16961).
It is clear that the source of 1,034 unique genes in TSY216 is the (recently introduced) megaplasmid.
Gene presence/absence analysis
Gene content information (as expressed by the presence or absence of genes) can be explored in different ways. Here, we will perform clustering analysis on the gene content among strains to see how they are related to each other.
- TSY216 has a large number of genes that are only present in that strain. Therefore, it was recovered as an outlier in this UPGMA dendrogram.
The figure below is the same dendrogram without detailed gene content information.
- Because TSY216 has so many extra genes, it became an outlier in the clustering analysis based on gene content.
However, if we use only differential gene (CDS) information (taking out singletons such as the 1,034 unique genes in TSY216), TSY216 is right back to the position near its true phylogenetical neighbors.
- Without the extra genes in contig #3 (mega-plasmid), TSY216 shows a very similar gene content to that of other 7th pandemic strains.
Tracking horizontal (lateral) gene transfer
In the bacterial world, genes often move as a package called a “gene cluster,” or more frequently called a “genomic island”. The best way of identifying such regions in a genome is to use a pairwise gene content matrix. In our example case, there are 16 strains so we will compare them one by one (as all 16 x 16 combinations) to identify orthologs. If you want to know more about how to identify orthologs using a pairwise approach, please visit here.
Here, we will compare the gene content between a reference genome and all others as a form of gene presence/absence. We will focus on the genomic region that is responsible for synthesizing O antigen, therefore conferring O serotype.
- Go to [PAIRWISE ANALYSIS]→[Pairwise Ortholog Matrix].
- Select “N16961” as the reference genome and choose [RBH(Protein)] algorithm. RBH stands for “Reciprocal Best Hit” which is explained in detail here.
- The “reference genome” is set to strain N16961. You may need to click [Apply] to apply the new reference genome.
- Scroll down the ortholog matrix(heatmap) to go to CDS GCF_000006745.1_00231. Move the cursor to the square representing GCF_000006745.1_00231. A tooltip will appear with the positional information (1:231:24514; it means contig #1, feature #231, on the position of 24514 bp).
- Move the cursor to the square representing the ortholog of GCF_000006745.1_00231 in strain MO10 (serotype O139). The tooltip of “23:345:89195” will appear, meaning that this gene is located on 89195 bp on contig #23, as 345th gene of the genome.
- Similarly, this square represents a gene with “1:261:277002” of strain N16961.
- This square represents a gene with “23:382:131391” of strain MO10.
In this ortholog matrix, genomes with O1 serotype shared most of the genes. However, O139 strains (MO10 & 4260B) showed a substantial number of missing genes/orthologs (indicated in red and orange). There are 30 genes in N16961 within this region (261-231; square d-b). In contrast, the corresponding region in MO10 contains even more, with 37 genes (382-345; square e-c). However, we cannot see the genomic organization of MO10 in this matrix as N16961 was the reference genome.
Now, we will change our reference to MO10 and investigate what is going on in the MO10 genome. In the above matrix, we noted that the starting gene of the region was located on contig #3 and 345th gene. Change the reference to MO10, click [Apply] and move to the point of ortholog matrix at contig #3 and GCF_000152425.1_00345.
Now, we will check out the situation in serotype O139, particularly the genomic region of MO10’s contig #23 positions between 89195 and 131391. Change the reference genome to MO10 and click [Apply]. Then, browse down the ortholog matrix to view our target region (check “Contig #” and “Location” to 23:89195 or 23:131391 depending on the direction of the contig #23).
In the screenshot below, you will easily find that a large set of genes were inserted in the region that is present in O1 serotypes. Since a majority of O1 genes are missing in O139 strains, we can hypothesize that this region in O139 has been replaced by other genes (by deletion and then insertion). This process is a type of puzzle solving.
- The reference genome is MO10 (O139 serotype).
- Many genes are only present in O139 strains indicating that this region replaced the O1 serotype-specific region. Therefore, a serotype conversion occurred (From O1 to O139 serotype) by horizontal gene transfer.
If you investigate the functions of these genes (see [Product] column), you will find many genes conferring lipopolysaccharide biosynthesis.
Phylogenetic history of strains (genomes) or genes
In EzBioCloud CG service, all genes/CDSs of the genomes included in the set are clustered into non-redundant orthologous groups (See here for how to generate pan-genome and core-genome). Here, we will try to generate a phylogenetic tree based on the genome and single gene.
Again, the below is the “ortholog matrix” for the genomic region coding for O antigen (reference=N16961 O1 El Tor).
- The “reference genome” is set to strain N16961.
- The CDS name GCF_000006745.1_00227 (Gene name=VC0236 / waaF,rfaF) is present in all genomes.
We will take this gene, namely VC0236 / waaF,rfaF, encoding for ADP-heptose-LPS heptosyltransferase 2 (required for O lipopolysaccharide antigen) to draw a phylogenetic tree. To retrieve all genes from 16 genomes, go to [PAN-GENOME]→[Browse Pan-genome Orthologous Groups (POGs)] where all the genes are clustered then organized into a pan-genome.
- Enter “waaF” to find your gene quickly.
- Click the POG name to access 16 genes belonging to this POG.
You are now in the page for browsing all genes that are included in the POG, annotated as waaF gene.
To get the DNA sequences of all waaF genes, click on the [Download DNA] button. If you have installed MEGA program, the downloaded FASTA format file can be opened by MEGA program. After multiple alignment and phylogenetic analysis, you will get the following phylogenetic tree of the waaF gene. (If you use MEGA version 7, follow the instruction below)
- Carry out “multiple alignment” by selecting [Alignment]→[Align by ClustalW] with default options.
- Move to “phylogenetic analysis” by selecting [Data]→[Phylogenetic analysis].
- Select MEGA main window and go to [Phylogeny]→[Construct Maximum Likelihood Tree] to build a ML tree.
The figure below illustrates the difference between the phylogeny of strains (whole genome) and a gene (waaF in this case). We will not go into details here, but there were no lateral gene transfer events of the waaF gene among the pandemic strains. It is interesting that LMA3984-4, a non-toxigenic O1 El Tor strain, shows high similarity to pandemic strains in the waaF gene, but not the whole genome. At least for the waaF gene, pandemic strains and LMA3984-4 share a very recent ancestor which may be the source of the gene cluster that codes for O-antigen biosynthesis, which is currently found in pandemic strains.
What we explained here is only a fraction of what is offered by EzBioCloud’s comparative genomics. More extended manuals and success stories will be posted. Many of the analytic methods explained here were published in Chun et al. (2009), so read this paper if you want to know more about the methodology for comparative genomics and Vibrio cholerae.
- Okada, K. et al. Characterization of 3 Megabase-Sized Circular Replicons from Vibrio cholerae. Emerg Infect Dis 21, 1262-1263 (2015).
- Chun, J. et al. Comparative genomics reveals mechanism for short-term and long-term clonal transitions in pandemic Vibrio cholerae. Proc Natl Acad Sci U S A 106, 15442-15447 (2009).
This tutorial was prepared by Jon Jongsik Chun (Seoul National Univ/CJ Bioscience, Inc.) and Suyeon Hong (Yale Univ).
Last updated on July 2nd, 2017.