import¶
Primary pipeline to process and evaluate data to be imported into the MySQL database.
- pdm_utils.pipelines.import_genome.check_bundle(bndl, ticket_ref='', file_ref='', retrieve_ref='', retain_ref='')¶
Check a Bundle for errors.
Evaluate whether all genomes have been successfully grouped, and whether all genomes have been paired, as expected. Based on the ticket type, there are expected to be certain types of genomes and pairs of genomes in the bundle.
- Parameters
bndl – same as for run_checks().
ticket_ref – same as for prepare_bundle().
file_ref – same as for prepare_bundle().
retrieve_ref – same as for prepare_bundle().
retain_ref – same as for prepare_bundle().
- pdm_utils.pipelines.import_genome.check_cds(cds_ftr, eval_flags, description_field='product')¶
Check a Cds object for errors.
- Parameters
cds_ftr (Cds) – A pdm_utils Cds object.
eval_flags (dicts) – Dictionary of boolean evaluation flags.
description_field (str) – Description field to check against.
- pdm_utils.pipelines.import_genome.check_genome(gnm, tkt_type, eval_flags, phage_id_set={}, seq_set={}, host_genus_set={}, cluster_set={}, subcluster_set={}, accession_set={})¶
Check a Genome object parsed from file for errors.
- Parameters
gnm (Genome) – A pdm_utils Genome object.
tkt (Ticket) – A pdm_utils Ticket object.
eval_flags (dicts) – Dictionary of boolean evaluation flags.
phage_id_set (set) – Set of PhageIDs to check against.
seq_set (set) – Set of genome sequences to check against.
host_genus_set (set) – Set of host genera to check against.
cluster_set (set) – Set of clusters to check against.
subcluster_set (set) – Set of subclusters to check against.
accession_set (set) – Set of accessions to check against.
- pdm_utils.pipelines.import_genome.check_retain_genome(gnm, tkt_type, eval_flags)¶
Check a Genome object currently in database for errors.
- Parameters
gnm (Genome) – A pdm_utils Genome object.
tkt_type (str) – ImportTicket type
eval_flags (dicts) – Dictionary of boolean evaluation flags.
- pdm_utils.pipelines.import_genome.check_source(src_ftr, eval_flags, host_genus='')¶
Check a Source object for errors.
- Parameters
src_ftr (Source) – A pdm_utils Source object.
eval_flags (dicts) – Dictionary of boolean evaluation flags.
host_genus (str) – Host genus to check against.
- pdm_utils.pipelines.import_genome.check_ticket(tkt, type_set={}, description_field_set={}, eval_mode_set={}, id_dupe_set={}, phage_id_dupe_set={}, retain_set={}, retrieve_set={}, add_set={}, parse_set={})¶
Evaluate a ticket to confirm it is structured appropriately.
The assumptions for how each field is populated varies depending on the type of ticket.
- Parameters
tkt – same as for set_cds_descriptions().
type_set (set) – Set of ImportTicket types to check against.
description_field_set (set) – Set of description fields to check against.
eval_mode_set (set) – Set of evaluation modes to check against.
id_dupe_set (set) – Set of duplicated ImportTicket ids to check against.
phage_id_dupe_set (set) – Set of duplicated ImportTicket PhageIDs to check against.
retain_set (set) – Set of retain values to check against.
retrieve_set (set) – Set of retrieve values to check against.
add_set (set) – Set of add values to check against.
parse_set (set) – Set of parse values to check against.
- pdm_utils.pipelines.import_genome.check_tmrna(tmrna_ftr, eval_flags)¶
Check a Tmrna object for errors.
- Parameters
tmrna_ftr (Tmrna) – A pdm_utils Cds object.
eval_flags (dicts) – Dictionary of boolean evaluation flags.
- pdm_utils.pipelines.import_genome.check_trna(trna_ftr, eval_flags)¶
Check a Trna object for errors.
- Parameters
trna_ftr (Trna) – A pdm_utils Trna object.
eval_flags (dicts) – Dictionary of boolean evaluation flags.
- pdm_utils.pipelines.import_genome.compare_genomes(genome_pair, eval_flags)¶
Compare two genomes to identify discrepancies.
- Parameters
genome_pair (GenomePair) – A pdm_utils GenomePair object.
eval_flags (dicts) – Dictionary of boolean evaluation flags.
- pdm_utils.pipelines.import_genome.data_io(engine=None, genome_folder=PosixPath('.'), import_table_file=PosixPath('.'), genome_id_field='', host_genus_field='', prod_run=False, description_field='', eval_mode='', output_folder=PosixPath('.'), interactive=False, accept_warning=False)¶
Set up output directories, log files, etc. for import.
- Parameters
engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.
genome_folder (Path) – Path to the folder of flat files.
import_table_file (Path) – Path to the import table file.
genome_id_field (str) – The SeqRecord attribute that stores the genome identifier/name.
host_genus_field (str) – The SeqRecord attribute that stores the host genus identifier/name.
prod_run (bool) – Indicates whether MySQL statements will be executed.
description_field (str) – The SeqFeature attribute that stores the feature’s description.
eval_mode (str) – Name of the evaluation mode to evaluation genomes.
output_folder (Path) – Path to the folder to store results.
interactive (bool) – Indicates whether user is able to interact with genome evaluations at run time
accept_warning (bool) – Toggles whether the import pipeline will accept warnings without interactivity.
- pdm_utils.pipelines.import_genome.get_logfile_path(bndl, paths_dict=None, filepath=None, file_ref=None)¶
Choose the path to output the file-specific log.
- Parameters
bndl – same as for run_checks().
paths_dict (dict) – Dictionary indicating paths to success and fail folders.
filepath (Path) – Path to flat file.
file_ref – same as for prepare_bundle().
- Returns
Path to log file to store flat-file-specific evaluations. If paths_dict is set to None, then None is returned instead of a path.
- Return type
Path
- pdm_utils.pipelines.import_genome.get_mysql_reference_sets(engine)¶
Get multiple sets of data from the MySQL database for reference.
- Parameters
engine – same as for data_io().
- Returns
Dictionary of unique PhageIDs, clusters, subclusters, host genera, accessions, and sequences stored in the MySQL database.
- Return type
dict
- pdm_utils.pipelines.import_genome.get_phagesdb_reference_sets()¶
Get multiple sets of data from PhagesDB for reference.
- Returns
Dictionary of unique clusters, subclusters, and host genera stored on PhagesDB.
- Return type
dict
- pdm_utils.pipelines.import_genome.get_result_string(object, attr_list)¶
Construct string of values from several object attributes.
- Parameters
object (misc) – A object from which to retrieve values.
attr_list (list) – List of strings indicating attributes to retrieve from the object.
- Returns
A concatenated string representing values from all attributes.
- Return type
str
- pdm_utils.pipelines.import_genome.import_into_db(bndl, engine=None, gnm_key='', prod_run=False)¶
Import data into the MySQL database.
- Parameters
bndl – same as for run_checks().
engine – same as for data_io().
gnm_key (str) – Identifier for the Genome object in the Bundle’s genome dictionary.
prod_run – same as for data_io().
- pdm_utils.pipelines.import_genome.log_and_print(msg, terminal=False)¶
Print message to terminal in addition to logger if needed.
- Parameters
msg (str) – Message to print.
terminal (bool) – Indicates if message should be printed to terminal.
- pdm_utils.pipelines.import_genome.log_evaluations(dict_of_dict_of_lists, logfile_path=None)¶
Export evaluations to log.
- Parameters
dict_of_dict_of_lists (dict) – Dictionary of evaluation dictionaries. Key1 = Bundle ID. Value1 = dictionary for each object in the Bundle. Key2 = object type (‘bundle’, ‘ticket’, etc.) Value2 = List of evaluation objects.
logfile_path (Path) – Path to the log file.
- pdm_utils.pipelines.import_genome.main(unparsed_args_list)¶
Runs the complete import pipeline.
This is the only function of the pipeline that requires user input. All other functions can be implemented from other scripts.
- Parameters
unparsed_args_list (list) – List of strings representing command line arguments.
- pdm_utils.pipelines.import_genome.parse_args(unparsed_args_list)¶
Verify the correct arguments are selected for import new genomes.
- Parameters
unparsed_args_list (list) – List of strings representing command line arguments.
- Returns
ArgumentParser Namespace object containing the parsed args.
- Return type
Namespace
- pdm_utils.pipelines.import_genome.prepare_bundle(filepath=PosixPath('.'), ticket_dict={}, engine=None, genome_id_field='', host_genus_field='', id=None, file_ref='', ticket_ref='', retrieve_ref='', retain_ref='', id_conversion_dict={}, interactive=False)¶
Gather all genomic data needed to evaluate the flat file.
- Parameters
filepath (Path) – Name of a GenBank-formatted flat file.
ticket_dict (dict) – A dictionary of pdm_utils ImportTicket objects.
engine – same as for data_io().
genome_id_field – same as for data_io().
host_genus_field – same as for data_io().
id (int) – Identifier to be assigned to the Bundle object.
file_ref (str) – Identifier for Genome objects derived from flat files.
ticket_ref (str) – Identifier for Genome objects derived from ImportTickets.
retrieve_ref (str) – Identifier for Genome objects derived from PhagesDB.
retain_ref (str) – Identifier for Genome objects derived from MySQL.
id_conversion_dict (dict) – Dictionary of PhageID conversions.
interactive – same as for data_io().
- Returns
A pdm_utils Bundle object containing all data required to evaluate a flat file.
- Return type
- pdm_utils.pipelines.import_genome.prepare_tickets(import_table_file=PosixPath('.'), eval_data_dict=None, description_field='', table_structure_dict={})¶
Prepare dictionary of pdm_utils ImportTickets.
- Parameters
import_table_file – same as for data_io().
description_field – same as for data_io().
eval_data_dict (dict) – Evaluation data dictionary Key1 = “eval_mode” Value1 = Name of the eval_mode Key2 = “eval_flag_dict” Value2 = Dictionary of evaluation flags.
table_structure_dict (dict) – Dictionary describing structure of the import table.
- Returns
Dictionary of pdm_utils ImportTicket objects. If a problem was encountered parsing the import table, None is returned.
- Return type
dict
- pdm_utils.pipelines.import_genome.process_files_and_tickets(ticket_dict, files_in_folder, engine=None, prod_run=False, genome_id_field='', host_genus_field='', interactive=False, log_folder_paths_dict=None, accept_warning=False)¶
Process GenBank-formatted flat files and import tickets.
- Parameters
ticket_dict (dict) – A dictionary WHERE key (str) = The ticket’s phage_id value (Ticket) = The ticket
files_in_folder (list) – A list of filepaths to be parsed.
engine – same as for data_io().
prod_run – same as for data_io().
genome_id_field – same as for data_io().
host_genus_field – same as for data_io().
interactive – same as for data_io().
accept_warning – same as for data_io().
log_folder_paths_dict (dict) – Dictionary indicating paths to success and fail folders.
- Returns
tuple of five objects WHERE [0] success_ticket_list (list) is a list of successful ImportTickets. [1] failed_ticket_list (list) is a list of failed ImportTickets. [2] success_filepath_list (list) is a list of successfully parsed flat files. [3] failed_filepath_list (list) is a list of unsuccessfully parsed flat files. [4] evaluation_dict (dict): dictionary from each Bundle, containing dictionaries for each bundled object, containing lists of evaluation objects.
- Return type
tuple
- pdm_utils.pipelines.import_genome.review_bundled_objects(bndl, interactive=False, accept_warning=False)¶
Review all evaluations of all bundled objects.
Iterate through all objects stored in the bundle. If there are warnings, review whether status should be changed.
- Parameters
bndl – same as for run_checks().
interactive – same as for data_io().
accept_warnring – same as for data_io().
- pdm_utils.pipelines.import_genome.review_cds_descriptions(feature_list, description_field)¶
Iterate through all CDS features and review descriptions.
- Parameters
feature_list (list) – A list of pdm_utils Cds objects.
description_field – same as for data_io().
- Returns
Name of the primary description_field after review.
- Return type
str
- pdm_utils.pipelines.import_genome.review_evaluation(evl, interactive=False, accept_warning=False)¶
Review an evaluation object.
- Parameters
evl (Evaluation) – A pdm_utils Evaluation object.
interactive – same as for data_io().
accept_warning – same as for data_io().
- Returns
tuple (exit, message) WHERE exit (bool) indicates whether user selected to exit the review process. correct(bool) indicates whether the evalution status is accurate.
- Return type
tuple
- pdm_utils.pipelines.import_genome.review_evaluation_list(evaluation_list, interactive=False, accept_warning=False)¶
Iterate through all evaluations and review ‘warning’ results.
- Parameters
evaluation_list (list) – List of pdm_utils Evaluation objects.
interactive – same as for data_io().
accept_warning – same as for data_io().
- Returns
Indicates whether user selected to exit the review process.
- Return type
bool
- pdm_utils.pipelines.import_genome.review_object_list(object_list, object_type, attr_list, interactive=False, accept_warning=False)¶
Determine if evaluations are present and record results.
- Parameters
object_list (list) – List of pdm_utils objects containing evaluations.
object_type (str) – Name of the pdm_utils object.
attr_list (list) – List of attributes used to log data about the object instance.
interactive – same as for data_io().
accept_warning – same as for data_io().
- pdm_utils.pipelines.import_genome.run_checks(bndl, accession_set={}, phage_id_set={}, seq_set={}, host_genus_set={}, cluster_set={}, subcluster_set={}, file_ref='', ticket_ref='', retrieve_ref='', retain_ref='')¶
Run checks on the different types of data in a Bundle object.
- Parameters
bndl (Bundle) – A pdm_utils Bundle object containing bundled data.
accession_set (set) – Set of accessions to check against.
phage_id_set (set) – Set of PhageIDs to check against.
seq_set (set) – Set of nucleotide sequences to check against.
host_genus_set (set) – Set of host genera to check against.
cluster_set (set) – Set of Clusters to check against.
subcluster_set (set) – Set of Subclusters to check against.
file_ref – same as for prepare_bundle().
ticket_ref – same as for prepare_bundle().
retrieve_ref – same as for prepare_bundle().
retain_ref – same as for prepare_bundle().
- pdm_utils.pipelines.import_genome.set_cds_descriptions(gnm, tkt, interactive=False)¶
Set the primary CDS descriptions.
- Parameters
gnm (Genome) – A pdm_utils Genome object.
tkt (ImportTicket) – A pdm_utils ImportTicket object.
interactive – same as for data_io().