Pan-genome and core-genome can be obtained from multiple genomes in a data set. Both core- and pan-genomes are collections of orthologs or orthologous groups, and in EzCgDb we call this ”
Pan-genome and core-genome can be obtained from multiple genomes in a data set. Both core- and pan-genomes are collections of orthologs or orthologous groups, and in EzCgDb we call this “Pan-genome orthologous groups (POGs)“.
Let’s consider that you have 100 genomes in a data set. The most strictly counted core-genome can be obtained by setting the cutoff at 100% (i.e., POGs that are present in all genomes). With this setting, even POGs that are present in 99 genomes will not be considered. However, in many cases, we would like to consider less stringent cutoff for the following reasons:
- Not all genomes are completely sequenced, so some CDSs may not be included in final contigs (assemblies).
- Gene prediction process (software that finds locations of CDSs) may miss the correct CDS
- For CDS products that are not really important (=house keeping), often their function/role are carried out by other CDSs that are very different in sequences. In this case, we will miss this CDSs that has different sequences as we are using a sequence-based approach in detecting orthologs (See here for more details).
A popular cutoff is at 95%, in which we can pick up orthologs that are present in 95-100 genomes. Core-genomes can be obtained at different cutoffs, and Pan-genome is actually core-genome obtained with 0% cutoff, which can be found even in a single genome.
The following chart in EzCgDb is based on 100 genomes of Acinetobacter baumannii. It highlights that the pan-genome contains more POGs without known homologs in the database (X category in the below chart), implying that accessory genes are new to us. However today, it is well known that accessory genes that are present in 1-2 genomes only are mostly from mobile genetic elements such as bacteriophages.
Updated on May 17th 2016 (EK)