Reviewing genome annotations¶
The Actino_Draft database is routinely updated with new genomics data. When new genome annotations need to to be imported into the database, they are processed using pdm_utils import
which reviews the quality of the annotations and how they relate to data already present in the database. The import pipeline processes GenBank-formatted flat files and checks for a variety of potential errors, including:
Changes to the genome sequence
Phage name typos
Host name typos
Missing CDS locus tags
Incorrect protein translation tables used
Missing protein translations
Incorrect field used to store the gene descriptions
Incorrect tRNA genes
Note
pdm_utils import
checks the quality of flat files for the purpose of maintaining database consistency and integrity. Although it does check the quality of many aspects of the genome, it is not intended for comprehensive evaluation of the quality, validity, or biological accuracy of all data stored in the flat file.
After creating the GenBank-formatted flat file, annotators can follow the steps below to review their files using this pipeline to verify that it contains all the necessary information to be successfully imported into the Actino_Draft database:
Ensure that MySQL is installed. If using a Mac, also ensure that the MySQL server is turned ON (installation).
Open a Terminal window.
If Conda is used to manage dependencies, activate the Conda environment. If Conda is not used, ensure all dependencies are installed (installation).
Ensure that the newest version of
pdm_utils
is installed (installation).Ensure you have the most recent version of the Actino_Draft database, using get_db.
Create a folder (such as ‘validation’) to work in and navigate to it:
> mkdir validation > cd ./validationWithin this new folder, create a csv-formatted import table (such as ‘import_table.csv’) of import tickets. For routine review of flat files to replace auto-annotated ‘draft’ genomes in the database, a simplified import table can be used consisting only of the ‘type’ and ‘phage_id’ fields. A template table is provided on the
pdm_utils
source code repository on GitHub. The ticket table should contain one ticket per flat file, in which:
‘type’ is set to ‘replace’.
‘phage_id’ should be changed for each flat file.
Example ticket table with 3 tickets:
type
phage_id
replace
Trixie
replace
L5
replace
D29
Create a new folder (such as ‘genomes’) within the validation folder to contain all flat files to be checked:
> mkdir genomesManually move all flat files into that folder. No other files should be present.
Run
import
. The pipeline requires you to indicate the name of the database, the folder of flat files, the import table, and where to create the output folder. Below is an example of the command that executes the script, assuming you are still in the ‘validation’ folder:> python3 -m pdm_utils import Actino_Draft ./genomes/ ./import_table.csv -o ./Note
By default, the pipeline does not run in ‘production’ mode, so it does not actually make any changes to the database.
When prompted, provide your MySQL username and password to access your local Actino_Draft database.
The file is automatically processed, generating a log file of errors.
After the evaluation is complete, review specific errors in the log file if needed.
Repeat process if needed. After any errors are identified, re-create the flat files with the appropriate corrections, and repeat the import process to ensure the corrected file now passes validation.
Once everything is correct, upload the flat file to PhagesDB for official import into the database.