Statistical comparison of genomic properties

Tetra-Nucleotide Analysis (TNA)
[Tutorial] Comparative Genomics of Vibrio cholerae

Statistical comparison of genomic properties

In this analysis, the following genomic properties are compared to elucidate possible correlation between them.

  • Genome size in Mbp (million base pairs)
  • DNA GC content in percentage
  • Number of CDSs
  • Mean length of CDSs
  • Mean length of intergenic regions

Depending on the set of genomes, two properties may or may not be correlated. This can be checked by the simple regression analysis. If you do this analysis on a set containing genomes of a species, you may find interesting evolutionary traits in that species.

This analysis is carried out using the lm function of the R package.

The following chart represents the combination of comparison between 5 genomic properties among 31 Vibrio vulnificus strains.

(A) This plot shows the regression analysis between genome size and the number of CDSs among 31 V. vulnificus strains. The number of CDSs within this species is positively correlated to genome size (r2=0.89) and the coefficient of determination (R2=0.78) means that 78% of data points support the predicted regression line (y=-816+1082x).

(B) This plot gives the general statistical values (mean, median, standard deviation) of a genomic property (in this case, mean length of intergenic region).

Last updated on April 28th, 2016 (EK)