UBCG stands for the up-to-date bacterial core gene. It is a method and software tool for inferring phylogenetic relationship using bacterial core gene set that is defined by up-to-date bacterial genome database.
This document is for version 3. If you have an older version, please download and install the latest version.
How to cite the UBCG pipeline
If you use this tools, please cite the following:
Na, S. I., Kim, Y. O., Yoon, S. H., Ha, S. M., Baek, I. & Chun, J. (2018). UBCG: Up-to-date bacterial core gene set and pipeline for phylogenomic tree reconstruction. J Microbiol 56, (in press). [View only version]
Gene set used in the UBCG pipeline
The most widely employed method for genome-based phylogenetic tree reconstruction is using the core gene set. The core gene set can be defined as
Genes that are present in the majority of species, if not all
Genes that are present in a single copy (likely orthologous but not paralogous)
The number of core genes varies depending on the scope of a target taxon. If you generate a phylogenetic tree for a species, the core gene set may consist of up to thousands of genes. However, to cover any taxa in the domain Bacteria, the core gene set should be restricted to the highly conserved ones (Bacterial Core Gene [BCG]).
Because the number and taxonomic coverage of complete genome sequences in the public database are not perfect, the number of BCG set varies over time.
Here, we compiled the latest bacterial core gene set, named UBCG, using the largest dataset ever (1,429 complete genome sequences, a single genome per a species, covering 28 phyla). The current UBCG set consists of 92 genes whose details are given here.
Concept of the UBCG pipeline
We designed the pipeline for users to handle hundreds of genomes, if not thousands. Here, the concept behind our design is briefed to help you understand and maximize the utility of our pipeline.
All UBCG sequences extracted from each genome sequence are stored in a single file (*.bcg). This file also contains a label with full information about the strain (e.g. Escherichia coli K12 MG1665) and other details (e.g. database accession). Once a bcg file is generated, it can be used for different analyses. This allows users to change the labels in the phylogenetic trees.
A run is carried out using a set of bcg files of user’s choice. For this, selected bcg files are saved in a single directory, then the UBCG pipeline will align each of the core genes, concatenate them, filter aligned positions, and calculate phylogenetic trees and gene support indices (GSIs).
If a user wants to run the pipeline for another set of bcg files, store the desired bcg files in bcg directory and re-run the pipeline. In other words, the set of bcg files to be analyzed together is controlled by the content of a directory holding bcg files.
It has been tested on Linux and Mac OS X 10 or higher. MS Windows is not supported due to the external programs used. Please use a virtual Linux machine such as the Virtualbox.
Unzip the UBCG.zip file in the desired directory.
File formats used in UBCG tool
The files with *.bcg extension are of JSON format and contain all extracted UBCG gene sequences with metadata (data about data). This file is a text format and readable by any text editor. So, you can extract sequence information and edit metadata, if necessary. bcg is designed to hold all necessary information about genome and strain.
FASTA is a standard file format for holding genome sequences. In UBCG tool, all fasta files containing genome sequences should be converted to bcg files before generating multiple alignments and inferring phylogenetic trees. The results of multiple alignments are also written as fasta format files.
Newick is a standard format for phylogenetic trees.
A JSON format file containing Newick-format trees and metadata of individual core gene trees and a UBCG tree.
A log file is a text format file that contains detailed information about the pipeline run.
A typical structure of directories
The program’s root directory should contain the “UBCG.jar” file and the “programPath” file [Learn more] that contains the location information of the external software tools.
“fasta” directory contains the FASTA format files holding example genome/contig sequences.
“bcg” directory contains JSON format files (=*.bcg) holding UBCG gene sequences with metadata.
“output” directory contains all output files generated by the UBCG tool. Within the “output” directory, results of each run are stored in the separate directory (defined by “-prefix”).
Installing external programs
The following programs should be installed in advance. The locations of programs should be written in “programPath” file.
You may also install and use other tools for phylogenetic inferences. Since we provide multiple-alignment files, any phylogenetic inference program can be used to generate phylogenetic trees from UBCGs.
Running UBCG pipeline
Step 1: Converting genome assemblies or contigs (fasta) to bcg files
Command: java -jar UBCG.jar extract
This command converts a fasta file to bcg file using prodigal and hmmsearch tools.
You are required to designate the following parameters:
-i : path of an input FASTA file containing genome assembles.
-bcg_dir : directory for all bcg files. The name of bcg file will be same as the fasta file. (e.g. -bcg_dir bcg)
-label : full label of the strain/genome. It should be encompassed by single quotes (e.g. -label “Escherichia coli O157 876”).
The followings are optional, but useful metadata
-taxon : name of species (e.g. -taxon “Escherichia coli”)
-strain : name of the strain (e.g. -strain “JC 126”)
-type : add this if a strain is the type strain of species or subspecies (e.g. -type)
-acc : accession of a genome sequence. Usually, NCBI’s assembly accession is used for public domain data.
-uid : this is a unique integer id. If you do not designate, one will be automatically generated for you. Ignore this when you are not sure about this field.
The content of bcg files (for example, gene sequences) can be viewed (as CSV format that is readable by Microsoft Excel or Google spreadsheet) by using the following command:
Outputs will be saved in “output/my_example” directory
Take the “*.UBCG_gsi(92).codon.50.label.nwk” file and open with MEGA or other tree-viewing programs. (The below is the screenshot of MEGA.)
An example UBCG tree. (T) denotes the type strain.
This UBCG tree shows that
Two Streptococcus constellatus subspecies are closely related; all 92 UBCGs supported this.
Streptococcus pneumoniae TIGR4 is closely related to the type strain of Streptococcus pneumoniae; 72 out of UBCGs supported this.
Two tentatively new species, named CP012646_s and CP014326_s, formed a monophyletic clade with S. pneumoniae, Streptococcus pseudopneumoniae, and Streptococcus mitis.
The below is the dendrogram showing OrthoANI-based clustering of the same genomes. Please note that CP012646_s and CP014326_s represent a novel species using 95~96% average nucleotide identity (ANI) cutoff [Learn more].