Species Core Genome (SCG)

Type strain and reference strain
Species Core Genome (SCG)

Species Core Genome (SCG)


The Species Core Genome (SCG) is the artificially generated genome sequence that contains the set of core genes within a species.

How to construct SCG

An SCG is constructed using the following procedure:

  1. Select all complete genomes belonging to a target species. Taxonomic identification of each genome is confirmed using the genome-based identification algorithm (Chun et al. 2018).
  2. Generate the phylogenomic phylogenetic tree using the UBCG pipeline with the maximum likelihood method. Select the representative genomes manually from the resultant tree. This process is to avoid the phylogenetic bias of the representative genomes.
  3. Calculate the set of core gene clusters or orthologous groups using the Roary pipeline. The selected core genes are considered for whole genome Multilocus Sequence Typing (wgMLST).
  4. To construct the SCG, take one representative gene from each core gene cluster and append it to the SCG. put a priority on the representative gene of the genome with more historical significance over others. For instance, genes of the famous E. coli K12 will be considered first as it has been most extensively studied. In essence, SCG is a concatenation of core genes that are of different strains. Intergenic regions are not included in the SCG.


The SCG contains the core part of the genome of a given species, and can be used for the following purposes:

1. SNP-based phylogenomic treeing
Single nucleotide polymorphism (SNP) can be calculated (=called) for any genome of the target species using a standardized way; NUCmer is highly recommended. These SNP calls from the multiple genomes are combined to generate multiple sequence alignments which then can be used for phylogenetic analyses.

2. SNP-based rapid searching against the genome database
SNP calls against the same SCG can be used for searching a query genome against the database of multiple genomes. Because call SNP calls can be precalculated prior to searching, this process can be efficient when a newly sequenced genome is searched against a large database (e.g. E. coli/Shigella group has >10,000 genomes).

Last updated on Feb. 3