A program has been written which accesses the NCBI webpage, downloads the individual GenBank files and puts them together. The resulting GenBank file contains multiple GenBank files pasted together, one after another. The program is called getgbk and uses a GPID or a NCBI accession number as an argument. The output from the program is a GenBank file equivalent to the one that is found on the webpage. Here we will use the program option -p which reads the input as a GPID. Another option is -a which reads the input as an NCBI accession number, you will need this to make genome atlases for individual chromosomes.
The syntax of the program is shown below. Note the Unix usage of the ">" sign, which is a redirection of the output into a file. If this is not included (getgbk -p <GPID>), the program will write the output, the GenBank file, to the screen.
# Make sure that these new GenBank files are located in the same directory as the pre-downloaded files! getgbk -p <GPID> > <GPID>.gbk # NOTE: This step should only be performed on the additional-genomes. getgbk -a <ACC> > <ACC>.gbk # NOTE: This step should only be performed on the additional-genomes.
Investigate GenBank file format
First, look at the GenBank file. Open the file in a text-editor (xubuntu-biotools has a text-editor called mousepad).
In the beginning of the file is the metadata, names, publications, habitat and similar information.
The next part is the annotations, genes and CDS (CoDing Sequences).
In this section the genes are described by their location, direction, note, and translation|.
- What information is found in the line marked LOCUS?
- How many lines are marked LOCUS and what does this number show (Hint: use Unix command grep to find the LOCUS lines in a file)?
- Explain the content in the reoccurring fields marked source, gene and CDS.
Re-name GenBank files from GPID numbers to organism names.
To make it easier to recognize files they will now be renamed so they are called an organism name instead of a GPID number.
From this point on, <GPID> will be replaced with <name> and will refer to the organism name the file is given.
extractname <GPID>.gbk # NOTE: This step should only be performed on additional-genomes. # Do NOT change the names of the files in the project package!
Note that the files are not moved, but rather, they are copied into a new file.
Delete the numbered files using the command rm.
The new files will from here on be referred to as <name>.gbk in the command syntax.
- What could the extractname program be doing? Where does the program get the name from?
- To look at the code, open it in the texteditor mousepad /usr/biotools/extractname.