Biotools for Comparative Microbial Genomics Wiki
Register
Advertisement

Basic genomes statistics[]

The basic statistics are values calculated from raw DNA data, not genes. The results include:

  • Total length (base pairs)
  • Percentage AT
  • Standard deviation AT (in the case of multiple replicons/contigs)
  • Number of replicons/contigs
  • Percentage of unknown bases (not A, T, C or G)
  • Fraction of genome made up by largest contig/replicon, as a percentage of total genome length. This measure is mostly useful for evaluation if most of the genome is in one piece or if it is completely fragmented.
genomeStatistics <file>.fna 
Filename TotalBases: Per.AT: StDevAT: ContigCount: Per.Unknowns: Per.LargestSeq
<file>.fna 2132142 61.3707 0.0000 1 0.0000 100.0000


Unknown bases analysis[]

In some DNA sequences bases other than A, T, C or G are found. This can be a function of assembly programs where the distance between two sequences are known but not the sequence itself. The analysis of these DNA signatures produces the following measures:

  • Total number of bases 2209947
  • Total number of unknown stretches 99
  • Total number of unknowns 79605
  • Percentage of unknowns 3.60212258484027
  • Average length of unknown stretch 804.090909090909
  • Max/min length of unknown stretch 1780 141

The program is called as follows:

countUnknowns.pl Megamonas_hypermegale_ART12_1.fna 

Amino acid and codon usage[]

Vibrio parahaemolyticus RIMD 2210633 prodigal

Amino acid usage, codon usage, basic statistics and bias in third codon position

This system has some different ways of analyzing the third position base use, amino acid and codon usage. The first is a vizual presentation which should be used to present the patterns of a few genomes. This approach is not usefull for comparing many genomes. The analysis uses the open reading frame genes from a genefinder (DNA open reading frames, FASTA format):

>NZ_ADFU01000001__CDS_1275-526
ATGAAAAAATCCACTTTGCTTGCTTTCACAGCGGCAGTATTATTCGGCAGTGGCGT
CACGTTAATGCGGCATCTGCTACATATGATGATCCATTGCTTTTACCAAATCCTGC
GCGCCTACAACAGGTTCTGTTGTATTGGTTCCTGTGGCTAGCCCTCAGGCGGTGCA
............

The output of the analysis is a PDF file along with a raw data file, format shown bellow:

Veillonella_parvula_ATCC_17745_prodigal.orf.fna
TotalBases: 1900137 PerAT: 60.38 StDevAT: 0.04
codon	AAA	4.39974	27867
codon	CAA	2.79548	17706
.........
aa	C	0.9828
aa	P	3.6291

The analysis should be performed in a directory which has a file called <organismName>.orf.fna, and is run as follows:

basicGenomeAnalysis organismName /usr/bin/gnuplot

It is also possible to just run the calculations, without the visual presentation. This is more useful for comparing many genomes.

for i in *fna
do
perl /usr/biotools/indirect/atStats.pl $i > $i.atStats.tab
cat $i.atStats.tab > $i.CodonAaUsage
perl /usr/biotools/indirect/CodonAaUsage.pl $i >> $i.CodonAaUsage
rm $i.atStats.tab
done

To collect all the data for all genomes construct one file per type of data (amino acid usage, codon usage and statistics):

grep aa *AaUsage > aaUsage.all
sed -i s/_prodigal.orf.fna.CodonAaUsage:aa//g aaUsage.all 

grep Total *AaUsage > statistics.all
sed -i 's/_prodigal.orf.fna.CodonAaUsage:/\t/g' statistics.all
cut -f2,3,4,5,6,7,8 statistics.all > tmp.all
mv tmp.all statistics.all
sed -i 's/_prodigal.orf.fna//g' statistics.all

grep codon *AaUsage > codonUsage.all
sed -i s/_prodigal.orf.fna.CodonAaUsage:codon//g codonUsage.all

Questions[]

  • Use head and tail to have a look at these files. What do they contain?

These files can be used to plot several different patterns using Excel, R or other plotting programs. What this data can be used for depends on the aim of the study and can not be standardized. Here is shown an approach which compares XX of all genomes and graphically represents the results as a XX plot, To see how these plots were made, go to the R page: http://biotoolscmg.wikia.com/wiki/R#Reshape_table_to_matrix_.28heatmap.29

Advertisement