import: manage genome data and annotations

In general, data pertaining to a complete phage genome is managed within a MySQL database as a discrete unit, in which genome data (such as the PhageID, genome sequence, host data, etc.) is added, removed, and replaced concomitantly with all associated gene annotation data (primarily CDS features), such that any data pertaining to a particular phage in the database has been parsed from one external source, instead of added piecemeal and incrementally from separate sources or modified within the database after insertion. There are a few fields that are exceptions to this general practice (update)

pdm_utils import is used to manage the addition or replacement of genomes:

> python3 -m pdm_utils import Actino_Draft ./genomes/ ./import_table.csv -o ./ -p -c config_file.txt

The first argument (‘Actino_Draft’) following ‘import’ indicates the database to be used. The next argument (‘./genomes/’) indicates the directory where the flat files are located. The next argument (‘./import_table.csv’) indicates the location of the import table. The optional ‘-o ./’ argument indicates the directory where the import results should be stored (if omitted, the default is the working directory from where the pipeline is called). As with other pipelines, use of the Database management configuration file option can automate accessing MySQL.

This tool can be used to specifically update the Actino_Draft database, manage different MySQL database instances, and support the process of genome annotation because it:

  1. relies on import tickets to substantially automate the import process.

  2. performs evaluations to verify the quality of the incoming data.

  3. provides an interactive environment for a more flexible process.

Parse and validate import table

The first step of import is to parse and prepare tickets from the import table (Import tickets). The structure of the table as well as data in each ticket is validated. For each ticket type, there are specific rules regarding how the ticket fields are populated to ensure that the ticket is implemented correctly. Additionally, the pipeline confirms that there are no duplicated tickets or tickets with conflicting data (such as an add and remove ticket for the same phage). Import tickets are automatically generated by get_data, but they can also be manually generated.

Process flat files

After preparing import tickets from the import table, flat files (GenBank-formatted flat files) are processed one at a time, matched to the corresponding import ticket, evaluated, and implemented. For replace tickets, the current genome data in the database is removed and the data from the flat file is parsed and inserted. Two types of data are parsed from the flat file and evaluated: genome-specific and gene-specific data.

Genome-specific data

Genome-specific data, such as the phage name, nucleotide sequence, host genus, accession, and annotation authorship is parsed and stored in the phage table. The data in the flat file is matched to the import ticket by the phage name parsed from the file. Subsequently, the data is evaluated and compared to data in the import ticket and in the database. After this, several fields in the phage table are populated from data derived from the import ticket or from the flat file.

Matching tickets to flat files requires that the phage names are spelled identically. Sometimes this is not the case, in which the desired spelling of the phage name in the database (and thus in the import ticket) is slightly differently than the spelling in the GenBank record. These conflicts can arise for several reasons that cannot be immediately corrected (e.g. different nomenclature constraints, such as how “LeBron” is spelled “Bron” in the GenBank record).

To account for these conflicts, import contains a pre-defined phage name dictionary that converts several GenBank phage names to the desired phage name stored in the Actino_Draft database. This list only contains about two dozen name conversions and does not change frequently. To avoid phage name discrepancies, the phage name can be parsed from different parts of the file (such as the filename itself). This allows for greater flexibility when parsing batches of flat files that may not adhere to default expectations, such as when new database instances are developed for phages that have not been annotated from disparate sources. This option can be implemented as a command line option.

Gene-specific data

The second type of data parsed from the flat file pertains to individual genes (and is stored in the gene table). After parsing the genome-specific information, the annotated features are processed. The Source, tRNA, tmRNA, and CDS features are evaluated, and all others are ignored.

Note

Currently, tRNA and tmRNA features are not dynamically parsed from flat files.

CDS features are parsed, evaluated, and stored in the gene table. The majority of data that import stores in the gene table are derived directly from the flat file. Several things to note:

  1. GeneIDs represent the gene’s unique identifier in the database, and are automatically generated during import, irrespective of data from within the flat file.

  2. Gene descriptions are stored in the Notes field of the gene table. However, CDS features in flat files can contain descriptions in three different fields: PRODUCT, FUNCTION, and NOTE. The ‘description_field’ field in the import ticket indicates which of these three flat file fields are expected to contain gene description data in the flat file.

  3. The LocusTag field in the gene table is populated directly from the LOCUS_TAG field in the CDS feature. It provides an unambiguous link to the original CDS feature in the GenBank record. This is valuable when reporting the gene information in a publication, and it is required when requesting GenBank to update information about specific CDS features (such as corrections to coordinates or gene descriptions).

  4. In many GenBank records, CDS features may contain descriptions that are not informative (e.g. “hypothetical protein”, “phage protein”, etc). These generic descriptions are not retained.

Evaluations

For each flat file, import checks numerous fields for accuracy through a series of QC evaluations.

For some QC evaluations, an error is automatically logged when a problem is encountered. For other QC evaluations, a warning is reported when a problem is encountered, the data processing pauses, and the user is prompted to provide feedback about whether the evaluation should log a warning or an error.

Note

The prompt typically asks “Is this correct?” Replying “yes” indicates there is no true error, and no error will be logged. Replying “no” will log an error.

If a genome acquires one or more errors during import, the genome will not be imported, and no changes are made to the database for that genome. The success or failure of an import ticket has no impact on the success or failure of the next ticket. After all tickets are processed, import is completed.

Logging database changes

Several methods of tracking and managing tickets (and the associated genomes) as they pass or fail QC are implemented:

  1. A summary of the import process is reported in the UNIX shell during import and after all tickets are processed.

  2. The results of every ticket are recorded in a log file, including any errors and warnings that were generated. Searching for “warnings” or “errors” in the file can quickly highlight the potential problems.

  3. Tickets and genome files are copied to new folders based on their ‘success’ or ‘fail’ import status. This enables quick reference to the specific tickets and genome files that need to be reviewed, modified, and repeated.

  4. import can be run under ‘test’ or ‘production’ mode. During a production run, import tickets and genome files are processed and evaluated, and the database is updated as specified by the ticket if QC is passed. In contrast, during a test run, import tickets and genome files are processed and evaluated, but the database is not updated. The test run can determine if any particular group of tickets and flat files are ready to be imported without actually altering the database, allowing flat files to be repeatedly evaluated during the annotation process (Reviewing genome annotations).