import

Primary pipeline to process and evaluate data to be imported into the MySQL database.

pdm_utils.pipelines.import_genome.check_bundle(bndl, ticket_ref='', file_ref='', retrieve_ref='', retain_ref='')

Check a Bundle for errors.

Evaluate whether all genomes have been successfully grouped, and whether all genomes have been paired, as expected. Based on the ticket type, there are expected to be certain types of genomes and pairs of genomes in the bundle.

Parameters
  • bndl – same as for run_checks().

  • ticket_ref – same as for prepare_bundle().

  • file_ref – same as for prepare_bundle().

  • retrieve_ref – same as for prepare_bundle().

  • retain_ref – same as for prepare_bundle().

pdm_utils.pipelines.import_genome.check_cds(cds_ftr, eval_flags, description_field='product')

Check a Cds object for errors.

Parameters
  • cds_ftr (Cds) – A pdm_utils Cds object.

  • eval_flags (dicts) – Dictionary of boolean evaluation flags.

  • description_field (str) – Description field to check against.

pdm_utils.pipelines.import_genome.check_genome(gnm, tkt_type, eval_flags, phage_id_set={}, seq_set={}, host_genus_set={}, cluster_set={}, subcluster_set={}, accession_set={})

Check a Genome object parsed from file for errors.

Parameters
  • gnm (Genome) – A pdm_utils Genome object.

  • tkt (Ticket) – A pdm_utils Ticket object.

  • eval_flags (dicts) – Dictionary of boolean evaluation flags.

  • phage_id_set (set) – Set of PhageIDs to check against.

  • seq_set (set) – Set of genome sequences to check against.

  • host_genus_set (set) – Set of host genera to check against.

  • cluster_set (set) – Set of clusters to check against.

  • subcluster_set (set) – Set of subclusters to check against.

  • accession_set (set) – Set of accessions to check against.

pdm_utils.pipelines.import_genome.check_retain_genome(gnm, tkt_type, eval_flags)

Check a Genome object currently in database for errors.

Parameters
  • gnm (Genome) – A pdm_utils Genome object.

  • tkt_type (str) – ImportTicket type

  • eval_flags (dicts) – Dictionary of boolean evaluation flags.

pdm_utils.pipelines.import_genome.check_source(src_ftr, eval_flags, host_genus='')

Check a Source object for errors.

Parameters
  • src_ftr (Source) – A pdm_utils Source object.

  • eval_flags (dicts) – Dictionary of boolean evaluation flags.

  • host_genus (str) – Host genus to check against.

pdm_utils.pipelines.import_genome.check_ticket(tkt, type_set={}, description_field_set={}, eval_mode_set={}, id_dupe_set={}, phage_id_dupe_set={}, retain_set={}, retrieve_set={}, add_set={}, parse_set={})

Evaluate a ticket to confirm it is structured appropriately.

The assumptions for how each field is populated varies depending on the type of ticket.

Parameters
  • tkt – same as for set_cds_descriptions().

  • type_set (set) – Set of ImportTicket types to check against.

  • description_field_set (set) – Set of description fields to check against.

  • eval_mode_set (set) – Set of evaluation modes to check against.

  • id_dupe_set (set) – Set of duplicated ImportTicket ids to check against.

  • phage_id_dupe_set (set) – Set of duplicated ImportTicket PhageIDs to check against.

  • retain_set (set) – Set of retain values to check against.

  • retrieve_set (set) – Set of retrieve values to check against.

  • add_set (set) – Set of add values to check against.

  • parse_set (set) – Set of parse values to check against.

pdm_utils.pipelines.import_genome.check_tmrna(tmrna_ftr, eval_flags)

Check a Tmrna object for errors.

Parameters
  • tmrna_ftr (Tmrna) – A pdm_utils Cds object.

  • eval_flags (dicts) – Dictionary of boolean evaluation flags.

pdm_utils.pipelines.import_genome.check_trna(trna_ftr, eval_flags)

Check a Trna object for errors.

Parameters
  • trna_ftr (Trna) – A pdm_utils Trna object.

  • eval_flags (dicts) – Dictionary of boolean evaluation flags.

pdm_utils.pipelines.import_genome.compare_genomes(genome_pair, eval_flags)

Compare two genomes to identify discrepancies.

Parameters
  • genome_pair (GenomePair) – A pdm_utils GenomePair object.

  • eval_flags (dicts) – Dictionary of boolean evaluation flags.

pdm_utils.pipelines.import_genome.data_io(engine=None, genome_folder=PosixPath('.'), import_table_file=PosixPath('.'), genome_id_field='', host_genus_field='', prod_run=False, description_field='', eval_mode='', output_folder=PosixPath('.'), interactive=False, accept_warning=False)

Set up output directories, log files, etc. for import.

Parameters
  • engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.

  • genome_folder (Path) – Path to the folder of flat files.

  • import_table_file (Path) – Path to the import table file.

  • genome_id_field (str) – The SeqRecord attribute that stores the genome identifier/name.

  • host_genus_field (str) – The SeqRecord attribute that stores the host genus identifier/name.

  • prod_run (bool) – Indicates whether MySQL statements will be executed.

  • description_field (str) – The SeqFeature attribute that stores the feature’s description.

  • eval_mode (str) – Name of the evaluation mode to evaluation genomes.

  • output_folder (Path) – Path to the folder to store results.

  • interactive (bool) – Indicates whether user is able to interact with genome evaluations at run time

  • accept_warning (bool) – Toggles whether the import pipeline will accept warnings without interactivity.

pdm_utils.pipelines.import_genome.get_logfile_path(bndl, paths_dict=None, filepath=None, file_ref=None)

Choose the path to output the file-specific log.

Parameters
  • bndl – same as for run_checks().

  • paths_dict (dict) – Dictionary indicating paths to success and fail folders.

  • filepath (Path) – Path to flat file.

  • file_ref – same as for prepare_bundle().

Returns

Path to log file to store flat-file-specific evaluations. If paths_dict is set to None, then None is returned instead of a path.

Return type

Path

pdm_utils.pipelines.import_genome.get_mysql_reference_sets(engine)

Get multiple sets of data from the MySQL database for reference.

Parameters

engine – same as for data_io().

Returns

Dictionary of unique PhageIDs, clusters, subclusters, host genera, accessions, and sequences stored in the MySQL database.

Return type

dict

pdm_utils.pipelines.import_genome.get_phagesdb_reference_sets()

Get multiple sets of data from PhagesDB for reference.

Returns

Dictionary of unique clusters, subclusters, and host genera stored on PhagesDB.

Return type

dict

pdm_utils.pipelines.import_genome.get_result_string(object, attr_list)

Construct string of values from several object attributes.

Parameters
  • object (misc) – A object from which to retrieve values.

  • attr_list (list) – List of strings indicating attributes to retrieve from the object.

Returns

A concatenated string representing values from all attributes.

Return type

str

pdm_utils.pipelines.import_genome.import_into_db(bndl, engine=None, gnm_key='', prod_run=False)

Import data into the MySQL database.

Parameters
  • bndl – same as for run_checks().

  • engine – same as for data_io().

  • gnm_key (str) – Identifier for the Genome object in the Bundle’s genome dictionary.

  • prod_run – same as for data_io().

pdm_utils.pipelines.import_genome.log_and_print(msg, terminal=False)

Print message to terminal in addition to logger if needed.

Parameters
  • msg (str) – Message to print.

  • terminal (bool) – Indicates if message should be printed to terminal.

pdm_utils.pipelines.import_genome.log_evaluations(dict_of_dict_of_lists, logfile_path=None)

Export evaluations to log.

Parameters
  • dict_of_dict_of_lists (dict) – Dictionary of evaluation dictionaries. Key1 = Bundle ID. Value1 = dictionary for each object in the Bundle. Key2 = object type (‘bundle’, ‘ticket’, etc.) Value2 = List of evaluation objects.

  • logfile_path (Path) – Path to the log file.

pdm_utils.pipelines.import_genome.main(unparsed_args_list)

Runs the complete import pipeline.

This is the only function of the pipeline that requires user input. All other functions can be implemented from other scripts.

Parameters

unparsed_args_list (list) – List of strings representing command line arguments.

pdm_utils.pipelines.import_genome.parse_args(unparsed_args_list)

Verify the correct arguments are selected for import new genomes.

Parameters

unparsed_args_list (list) – List of strings representing command line arguments.

Returns

ArgumentParser Namespace object containing the parsed args.

Return type

Namespace

pdm_utils.pipelines.import_genome.prepare_bundle(filepath=PosixPath('.'), ticket_dict={}, engine=None, genome_id_field='', host_genus_field='', id=None, file_ref='', ticket_ref='', retrieve_ref='', retain_ref='', id_conversion_dict={}, interactive=False)

Gather all genomic data needed to evaluate the flat file.

Parameters
  • filepath (Path) – Name of a GenBank-formatted flat file.

  • ticket_dict (dict) – A dictionary of pdm_utils ImportTicket objects.

  • engine – same as for data_io().

  • genome_id_field – same as for data_io().

  • host_genus_field – same as for data_io().

  • id (int) – Identifier to be assigned to the Bundle object.

  • file_ref (str) – Identifier for Genome objects derived from flat files.

  • ticket_ref (str) – Identifier for Genome objects derived from ImportTickets.

  • retrieve_ref (str) – Identifier for Genome objects derived from PhagesDB.

  • retain_ref (str) – Identifier for Genome objects derived from MySQL.

  • id_conversion_dict (dict) – Dictionary of PhageID conversions.

  • interactive – same as for data_io().

Returns

A pdm_utils Bundle object containing all data required to evaluate a flat file.

Return type

Bundle

pdm_utils.pipelines.import_genome.prepare_tickets(import_table_file=PosixPath('.'), eval_data_dict=None, description_field='', table_structure_dict={})

Prepare dictionary of pdm_utils ImportTickets.

Parameters
  • import_table_file – same as for data_io().

  • description_field – same as for data_io().

  • eval_data_dict (dict) – Evaluation data dictionary Key1 = “eval_mode” Value1 = Name of the eval_mode Key2 = “eval_flag_dict” Value2 = Dictionary of evaluation flags.

  • table_structure_dict (dict) – Dictionary describing structure of the import table.

Returns

Dictionary of pdm_utils ImportTicket objects. If a problem was encountered parsing the import table, None is returned.

Return type

dict

pdm_utils.pipelines.import_genome.process_files_and_tickets(ticket_dict, files_in_folder, engine=None, prod_run=False, genome_id_field='', host_genus_field='', interactive=False, log_folder_paths_dict=None, accept_warning=False)

Process GenBank-formatted flat files and import tickets.

Parameters
  • ticket_dict (dict) – A dictionary WHERE key (str) = The ticket’s phage_id value (Ticket) = The ticket

  • files_in_folder (list) – A list of filepaths to be parsed.

  • engine – same as for data_io().

  • prod_run – same as for data_io().

  • genome_id_field – same as for data_io().

  • host_genus_field – same as for data_io().

  • interactive – same as for data_io().

  • accept_warning – same as for data_io().

  • log_folder_paths_dict (dict) – Dictionary indicating paths to success and fail folders.

Returns

tuple of five objects WHERE [0] success_ticket_list (list) is a list of successful ImportTickets. [1] failed_ticket_list (list) is a list of failed ImportTickets. [2] success_filepath_list (list) is a list of successfully parsed flat files. [3] failed_filepath_list (list) is a list of unsuccessfully parsed flat files. [4] evaluation_dict (dict): dictionary from each Bundle, containing dictionaries for each bundled object, containing lists of evaluation objects.

Return type

tuple

pdm_utils.pipelines.import_genome.review_bundled_objects(bndl, interactive=False, accept_warning=False)

Review all evaluations of all bundled objects.

Iterate through all objects stored in the bundle. If there are warnings, review whether status should be changed.

Parameters
  • bndl – same as for run_checks().

  • interactive – same as for data_io().

  • accept_warnring – same as for data_io().

pdm_utils.pipelines.import_genome.review_cds_descriptions(feature_list, description_field)

Iterate through all CDS features and review descriptions.

Parameters
  • feature_list (list) – A list of pdm_utils Cds objects.

  • description_field – same as for data_io().

Returns

Name of the primary description_field after review.

Return type

str

pdm_utils.pipelines.import_genome.review_evaluation(evl, interactive=False, accept_warning=False)

Review an evaluation object.

Parameters
  • evl (Evaluation) – A pdm_utils Evaluation object.

  • interactive – same as for data_io().

  • accept_warning – same as for data_io().

Returns

tuple (exit, message) WHERE exit (bool) indicates whether user selected to exit the review process. correct(bool) indicates whether the evalution status is accurate.

Return type

tuple

pdm_utils.pipelines.import_genome.review_evaluation_list(evaluation_list, interactive=False, accept_warning=False)

Iterate through all evaluations and review ‘warning’ results.

Parameters
  • evaluation_list (list) – List of pdm_utils Evaluation objects.

  • interactive – same as for data_io().

  • accept_warning – same as for data_io().

Returns

Indicates whether user selected to exit the review process.

Return type

bool

pdm_utils.pipelines.import_genome.review_object_list(object_list, object_type, attr_list, interactive=False, accept_warning=False)

Determine if evaluations are present and record results.

Parameters
  • object_list (list) – List of pdm_utils objects containing evaluations.

  • object_type (str) – Name of the pdm_utils object.

  • attr_list (list) – List of attributes used to log data about the object instance.

  • interactive – same as for data_io().

  • accept_warning – same as for data_io().

pdm_utils.pipelines.import_genome.run_checks(bndl, accession_set={}, phage_id_set={}, seq_set={}, host_genus_set={}, cluster_set={}, subcluster_set={}, file_ref='', ticket_ref='', retrieve_ref='', retain_ref='')

Run checks on the different types of data in a Bundle object.

Parameters
  • bndl (Bundle) – A pdm_utils Bundle object containing bundled data.

  • accession_set (set) – Set of accessions to check against.

  • phage_id_set (set) – Set of PhageIDs to check against.

  • seq_set (set) – Set of nucleotide sequences to check against.

  • host_genus_set (set) – Set of host genera to check against.

  • cluster_set (set) – Set of Clusters to check against.

  • subcluster_set (set) – Set of Subclusters to check against.

  • file_ref – same as for prepare_bundle().

  • ticket_ref – same as for prepare_bundle().

  • retrieve_ref – same as for prepare_bundle().

  • retain_ref – same as for prepare_bundle().

pdm_utils.pipelines.import_genome.set_cds_descriptions(gnm, tkt, interactive=False)

Set the primary CDS descriptions.

Parameters
  • gnm (Genome) – A pdm_utils Genome object.

  • tkt (ImportTicket) – A pdm_utils ImportTicket object.

  • interactive – same as for data_io().