Biotools for Comparative Microbial Genomics Wiki
Register
Advertisement

Downloading genomes from NCBI[]

A program has been written which accesses the NCBI webpage, downloads the individual GenBank files and puts them together. The resulting GenBank file contains multiple GenBank files pasted together, one after another. The program is called getgbk and uses a GPID or a NCBI accession number as an argument. The output from the program is a GenBank file equivalent to the one that is found on the webpage. Here we will use the program option -p which reads the input as a GPID. Another option is -a which reads the input as an NCBI accession number, you will need this to make genome atlases for individual chromosomes.
The syntax of the program is shown below. Note the Unix usage of the ">" sign, which is a redirection of the output into a file. If this is not included (getgbk -p <GPID>), the program will write the output, the GenBank file, to the screen.

getgbk -p <GPID> > <GPID>.gbk    
getgbk -a <ACC> > <ACC>.gbk

Investigate GenBank file format[]

First, look at the GenBank file. Open the file in a text-editor (xubuntu-biotools has a text-editor called mousepad).

mousepad <name>.gbk

In the beginning of the file is the metadata, names, publications, habitat and similar information.
The next part is the annotations, genes and CDS (CoDing Sequences).
In this section the genes are described by their location, direction, note, and translation|.

Questions:[]

  1. What information is found in the line marked LOCUS?
  2. How many lines are marked LOCUS and what does this number show (Hint: use Unix command grep to find the LOCUS lines in a file)?
  3. Explain the content in the reoccurring fields marked source, gene and CDS.


Re-name GenBank files from GPID numbers to organism names.[]

To make it easier to recognize files they will now be renamed so they are called an organism name instead of a GPID number.
From this point on, <GPID> will be replaced with <name> and will refer to the organism name the file is given.

extractname <GPID>.gbk

Note that the files are not moved, but rather, they are copied into a new file. Delete the numbered files using the command rm.
The new files will from here on be referred to as <name>.gbk in the command syntax.

Questions[]

  1. What could the extractname program be doing? Where does the program get the name from?
  2. To look at the code, open it in the texteditor mousepad /usr/biotools/extractname.


How to remove many empty files in directory[]

When downloading files from NCBI, some id numbers might refer to empty projects.
In stead of looking up each project first it is faster to attempt downloading and not succeeding. After which the files can be deleted if empty.
In the directory where GenBank files are found, run the follown loop to remove files with sixe zero, e.g. empty files.

find ./ -name "*.dat" -size 0 -exec rm {} \;

or

for file in *; do if ! -s $file ; then rm $file; fi; done

How to remove empty FASTA entries in FASTA formated files[]

In some cases, a FASTA formated file will have a header line without a corresponding sequence. These entries might make problems for the programs used and should be removed. Her is a way to remove these zero-length entries:

for i in *.fsa 
do
echo $i
saco_convert -I Fasta -O tab $i > $i.tab
sed -r '/\t\t\t/d' $i.tab > $i.temp.tab
saco_convert -I tab -O Fasta $i.temp.tab > $i
rm *tab
done


Rename many files in directory[]

Example which will rename from*.gbk.fna to *.fna

for x in *gbk.fna; do newx=`echo $x | sed "s/.gbk.fna/.fna/"` ; mv $x $newx; done


Count number of replicons in genomes (GenBank files)[]

To count the number of replicons in multiple GenBank files, this approach can be used:

for x in *gbk; do grep -m 1 ORGANISM $x; grep -c LOCUS $x; done

The output looks as follows:

   ORGANISM  Veillonella parvula DSM 2008
 1
   ORGANISM  Acidaminococcus intestini RyC-MR95
 1
   ORGANISM  Acidaminococcus fermentans DSM 20731
 1
   ORGANISM  Megamonas hypermegale ART12/1
 1
   ORGANISM  Veillonella parvula DSM 2008
 2
   ORGANISM  Acidaminococcus fermentans DSM 20731
 2
   ORGANISM  Selenomonas sputigena ATCC 35185
 1
   ORGANISM  Selenomonas sputigena ATCC 35185
 2
   ORGANISM  Megasphaera elsdenii DSM 20460
 1
   ORGANISM  Megasphaera elsdenii DSM 20460
 2
   ORGANISM  Acidaminococcus intestini RyC-MR95
 2


Basic genome statistics[]

Basic genome statistics can be calculated for multiple DNA FASTA files, containing whole genome DNA (not genes):

for x in *fna; do genomeStatistics $x; done


Extract DNA and proteins from GenBank files[]

To extract complete DNA sequences from multiple GenBank files use the following loop:

for x in *gbk; do saco_convert -I genbank -O fasta <$x >$x.fna; done

This will create files with the extension *.gbk.fna.


Rename files[]

Files with wrong ending can be renamed the following type of loop:

for x in *gbk.fna; do newx=`echo $x | sed "s/.gbk.fna/.fna/"` ; mv $x $newx; done

Or another example:

for x in *gbk.proteins.fsa; do newx=`echo $x | sed "s/.gbk.proteins.fsa/.proteins.fsa/"` ; mv $x $newx; done


Extract proteins from GenBank files[]

This procedure extracts translated genes from GenBank files in the case where these have been annotated by the publisher of the genomes. To extract annotated proteins from GenBank files, use the following loop, NOTE, some genome project do not have annotated proteins but might still have DNA sequence.

for x in ∗gbk;  do  saco_extract -I genbank -O fasta -t < $x > $x.proteins.fsa; done


Genefinding in DNA[]

Running genefinding requires a DNA FASTA file:

for x in *fna; do prodigalrunner $x ; done


Count number of entries in FASTA[]

 grep -c ">" *fsa
Advertisement