BLAST matrix theoryEdit

Technical useEdit

Local BLAST using blastall and formatdbEdit

The NCBI BLAST webpage allows for comparing a query sequence against larges databases using the BLAST algorithm. As with the proteome comparisons presented here, it is sometimes useful to compare a query sequence to a singe genome or a small selection of genomes. The CMG-biotools does not include a local version of the NCBI databases, but does allow for the local construction of smaller databases and comparison with these. Construct a database from a FASTA file (A.fsa):

formatdb -i A.fsa -p T -t 1

Compare sequences in FASTA file with database (protein BLAST):

/usr/biotools/blast/blastall -F 0 -p blastp -d A.fsa -e 1e-5 -m 7 < C.fsa > qC_dA.blastout

Be aware, that the complete path to blastall must be given for this to work (as shown).

Post script modificationsEdit

Bounding boxEdit

If the names of the organisms reach outside the border of the picture, change the numbers in the BOUNDING BOX field of the .ps file. Each op these numbers correspond to different points in a coordinate system:

%%BoundingBox: 0 0 2212 1110
%%BoundingBox: llx lly urx ury

The numbers correspond to the size of the picture, and are coordinates llx lly urx ury (lower left corner (x,y), upper right corner (x,y)). Change the coordinates, save the ∗ps file and click the file again.

Hash lines in matrix post script fileEdit

If the matrix has a line for each organism saying something like

(HASH\(0x954dba8\))  -45 rotate   dup stringwidth pop neg 0 rmoveto show  45 rotate

This can be removed by removing the entire block for that element of the picture

183.497474683058 ux 111.639610306789 uy moveto
(HASH\(0x954dba8\))  -45 rotate   dup stringwidth pop neg 0 rmoveto show  45 rotate 

This can be done for all entries in the .ps file as follows:

awk '/HASH/{c=1;next} c--<0 && p{print p} {p=$0} END{print p} ' >

Pan- and core-genome plot Edit

The pan- and core-genome plot method looks at the cumulative set of all genes, shared across genomes (pan-genome) and the conserved set of gene families across all genomes (core-genome).

The pan- and core-genomes are theoretical representations of a collective protein pool and a conserved protein pool,
Fig7 Pancore

When a protein type is found in all genomes in a collection, it is called a core gene of this collection. Here this is implemented in a pan- and core-genome plot where sequences are compared using BLAST and the 50/50 % cutoff (50% identity over 50% coverage of the longest gene in the comparison). As the clusters grow to more than two members, single linkage clustering is used to assign a new sequence to a group. The program performing this analysis is called pancoreplot and the input is a tab separated text file representing a number of FASTA files containing amino acid sequences. This program takes all the protein FASTA files in a given directory. The protein FASTA file can be obtained by extracting proteins from a GenBank file (using saco extract) or by using the Prodigal genefinder (extract DNA from GenBank, saco convert, and find genes using prodigalrunner) For the first genome, the pan and core are identical, and the core becomes smaller with the addition of a second genome, as genes in this pool now need to be found in both genomes. If a gene from the core is not found in a new genome it is removed from the core, and is then only part of the pan-genome. The pan-genome is the entire gene pool and as such includes the core genome. The order of the genomes can change the course of the graph, but the final shared gene pool (core) will be the same.

The program takes an input file with two columns. The first column is a genome description and can not contain space or tab. The second column is the file name, if the script is run in the same directory as the file:

# Description Filename
CNRZ1066	Streptococcus_thermophilus_CNRZ1066_ID_58221_prodigal.orf.fsa
JIM_8232	Streptococcus_thermophilus_JIM_8232_ID_68521_prodigal.orf.fsa
LMD-9	        Streptococcus_thermophilus_LMD-9_ID_13773_prodigal.orf.fsa
LMG_18311	Streptococcus_thermophilus_LMG_18311_ID_58219_prodigal.orf.fsa
ND03	        Streptococcus_thermophilus_ND03_ID_49149_prodigal.orf.fsa

or a complete file path if the script is run in some other directory:

# Description File_path
CNRZ1066	/home/student/Strept_thermophilus_CNRZ1066_ID_58221_prodigal.orf.fsa
JIM_8232	/home/student/Strep_thermophilus_JIM_8232_ID_68521_prodigal.orf.fsa
LMD-9	        /home/student/Strep_thermophilus_LMD-9_ID_13773_prodigal.orf.fsa
LMG_18311	/home/student/Strept_thermophilus_LMG_18311_ID_58219_prodigal.orf.fsa
ND03           /home/student/Strep_thermophilus_ND03_ID_49149_prodigal.orf.fsa

The input file can be made manually or using this short-cut:

ls -1 *orf.fsa | gawk '{print $1 "\t" $1}' > pancore.list

The program writes the postscript file to the screen is not redirected using ">". The postscript file and a table of exact values can be found in the kept directory in the files named "ps" and "tbl" (these are the full file names)

pancoreplot pancore.list -keep blastOutPut >
pancoreplot -keep blastOutPut pancore.list >

student@student-VirtualBox:~$ ls -l blastOutPut/ps
-rw-r--r-- 1 student student 5265 2012-05-11 16:28 blastOutPut/ps
student@student-VirtualBox:~$ ls -l blastOutPut/tbl 
-rw-r--r-- 1 student student 240 2012-05-11 16:28 blastOutPut/tbl


Extract subsets of core and pan genomesEdit

See this page:,_pan-_and_core_genomes