pdm_utils.pipelines package

Submodules

pdm_utils.pipelines.compare_db module

Pipeline to compare data between MySQL, PhagesDB, and GenBank databases.

pdm_utils.pipelines.compare_db.add_filters(filter_obj, filters)

Add filters from command line to filter object.

pdm_utils.pipelines.compare_db.check_for_errors(self)
pdm_utils.pipelines.compare_db.check_gbk_cds(cds_ftr)

Check for errors in GenBank CDS feature.

pdm_utils.pipelines.compare_db.check_gbk_gnms(gnm_dict)

Check for errors in GenBank genome and CDS features.

pdm_utils.pipelines.compare_db.check_locus_tag(self)
pdm_utils.pipelines.compare_db.check_matched_gnms(gnm_triads)

Compare all matched data.

pdm_utils.pipelines.compare_db.check_mysql_cds(cds_ftr)

Check for errors in MySQL CDS feature.

pdm_utils.pipelines.compare_db.check_mysql_gnms(gnm_dict)

Check for errors in MySQL matched genome and CDS features.

pdm_utils.pipelines.compare_db.check_pdb_gnms(gnm_dict)

Check for errors in PhagesDB genome.

pdm_utils.pipelines.compare_db.check_status_accession(self)
pdm_utils.pipelines.compare_db.compute_amino_acid_errors(self, protein_alphabet)
pdm_utils.pipelines.compare_db.compute_boundary_error(self)
pdm_utils.pipelines.compare_db.compute_cds_feature_errors(self)
pdm_utils.pipelines.compare_db.compute_description_error(self)
pdm_utils.pipelines.compare_db.compute_gbk_cds_feature_errors(self)
pdm_utils.pipelines.compare_db.compute_genes_with_errors_tally(self)
pdm_utils.pipelines.compare_db.compute_nucleotide_errors(self, dna_alphabet)
pdm_utils.pipelines.compare_db.compute_status_description_error(self)
pdm_utils.pipelines.compare_db.create_cds_summary_data(summary)

Create summary of all CDS results.

pdm_utils.pipelines.compare_db.create_cds_summary_fields()

Create CDS summary row ids.

pdm_utils.pipelines.compare_db.create_feature_data(mysql_gnm, mixed_ftr)

Create feature data to output.

pdm_utils.pipelines.compare_db.create_gene_headers()

Create list of column headers.

pdm_utils.pipelines.compare_db.create_genome_data(gnm_triad)

Create genome data to output.

pdm_utils.pipelines.compare_db.create_genome_headers()

Create list of column headers.

pdm_utils.pipelines.compare_db.create_genome_summary_data(summary)

Create summary of all genome results.

pdm_utils.pipelines.compare_db.create_genome_summary_fields()

Create genome summary row ids.

pdm_utils.pipelines.compare_db.filter_mysql_genomes(genome_list)

Only keep selected subset of genomes with no value duplications.

pdm_utils.pipelines.compare_db.get_all_features(gnm_triad)

Create list of all features.

pdm_utils.pipelines.compare_db.get_dbs(pdb, gbk)

Create set of databases to compare to MySQL.

pdm_utils.pipelines.compare_db.get_genbank_data(ncbi_cred_dict, accession_set, batch_size=200)

Retrieve genomes from GenBank.

pdm_utils.pipelines.compare_db.get_ids(alchemist, filters, target_table)

Get list of unique MySQL target table primary key values to evaluate.

pdm_utils.pipelines.compare_db.get_pdb_data(interactive)

Retrieve data from PhagesDB.

pdm_utils.pipelines.compare_db.get_primary_key(metadata, target_table)

Get the primary key to the target table.

pdm_utils.pipelines.compare_db.main(unparsed_args_list)

Run compare pipeline.

pdm_utils.pipelines.compare_db.match_all_genomes(mysql_gnms, pdb_gnms, gbk_gnms, pdb_name_duplicates, mysql_acc_duplicates)

Match MySQL, PhagesDB, and GenBank genomes.

pdm_utils.pipelines.compare_db.modify_cds_class(CdsClass)

Add new attributes and methods to Cds class.

pdm_utils.pipelines.compare_db.modify_genome_class(GenomeClass)

Add new attributes and methods to Genome class.

pdm_utils.pipelines.compare_db.output_all_data(output_path, summary)

Output all analysis results.

pdm_utils.pipelines.compare_db.output_to_file(data_list, folder, filename)

Output list data to file.

pdm_utils.pipelines.compare_db.parse_args(unparsed_args_list)

Verify the correct arguments are selected for comparing databases.

pdm_utils.pipelines.compare_db.prepare_unmatched_to_gbk_output(gnms)

Prepare list of MySQL unmatched to GenBank data to be saved to file.

pdm_utils.pipelines.compare_db.prepare_unmatched_to_pdb_output(gnms)

Prepare list of MySQL unmatched to PhagesDB data to be saved to file.

pdm_utils.pipelines.compare_db.process_gbk_data(working_path, ncbi_creds, accessions, interactive, save)

Retrieve and process GenBank data.

pdm_utils.pipelines.compare_db.process_mysql_data(working_path, engine, phage_ids, interactive, save)

Retrieve and process MySQL data.

pdm_utils.pipelines.compare_db.process_phagesdb_data(working_path, interactive, save)

Retrieve data from PhagesDB and process results.

pdm_utils.pipelines.compare_db.record_compare_settings(working_path, database, version, filters, valid_dbs)

Save user-selected settings.

pdm_utils.pipelines.compare_db.record_unmatched_gbk_data(gnms, working_path)

Save unmatched MySQL data not matched to GenBank.

pdm_utils.pipelines.compare_db.record_unmatched_pdb_data(gnms, working_path)

Save unmatched MySQL data not matched to PhagesDB.

pdm_utils.pipelines.compare_db.save_gbk_genome(gnm, record, output_path, interactive)

Save GenBank record to file.

pdm_utils.pipelines.compare_db.save_gnms_to_fasta(gnm_dict, main_path, new_dir, interactive)

Save genome data to fasta file.

pdm_utils.pipelines.compare_db.save_seqrecord(seqrecord, output_path, file_prefix, ext, seqrecord_ext, interactive)

Save record to file.

pdm_utils.pipelines.compare_db.selected_authors_lst(lst1)
pdm_utils.pipelines.compare_db.set_gbk_cds_attr(cds_ftr)

Set compare-specific Cds attributes not in pdm_utils Cds class.

pdm_utils.pipelines.compare_db.set_gbk_gnm_attr(gnm)

Set compare-specific attributes not in pdm_utils Genome class.

pdm_utils.pipelines.compare_db.set_locus_tag_typo(self)
pdm_utils.pipelines.compare_db.set_mysql_cds_attr(cds_ftr)

Set compare-specific MySQL Cds attributes not in pdm_utils Cds class.

pdm_utils.pipelines.compare_db.set_mysql_gnm_attr(gnm_dict)

Set compare-specific attributes not in pdm_utils Genome class.

pdm_utils.pipelines.compare_db.set_search_genome_id(self)
pdm_utils.pipelines.compare_db.set_start_end_strand_id(self)
pdm_utils.pipelines.compare_db.summarize_data(matched_genomes_list, working_path)

Create summary of data and save.

pdm_utils.pipelines.convert_db module

Pipeline to upgrade or downgrade the schema of a MySQL database.

pdm_utils.pipelines.convert_db.convert_schema(engine, actual, dir, steps, verbose=False)

Iterate through conversion steps and convert database schema.

pdm_utils.pipelines.convert_db.get_conversion_direction(actual, target)

Determine needed conversion direction and steps.

pdm_utils.pipelines.convert_db.get_step_data(step_name)

Get dictionary of conversion step data.

pdm_utils.pipelines.convert_db.get_step_name(dir, step)

Generates the name of the script conversion filename.

pdm_utils.pipelines.convert_db.main(unparsed_args_list)

Run main conversion pipeline.

pdm_utils.pipelines.convert_db.parse_args(unparsed_args_list)

Verify the correct arguments are selected for converting database.

pdm_utils.pipelines.convert_db.print_summary(summary)

Print summary of data that could be lost or inaccurate.

pdm_utils.pipelines.email_submitters module

pdm_utils.pipelines.export_db module

Pipeline for exporting database information into files.

pdm_utils.pipelines.export_db.append_database_version(genome_seqrecord, version_data)

Function that appends the database version to the SeqRecord comments.

Parameters
  • genome_seqrecord – Filled SeqRecord object.

  • version_data (dict) – Dictionary containing database version information.

pdm_utils.pipelines.export_db.decode_results(results, columns, verbose=False)

Function that decodes encoded results from SQLAlchemy generated data.

Parameters
  • results (list[dict]) – List of data dictionaries from a SQLAlchemy results proxy.

  • columns (list[Column]) – SQLAlchemy Column objects.

pdm_utils.pipelines.export_db.execute_csv_export(db_filter, export_path, folder_path, columns, csv_name, data_cache=None, sort=[], raw_bytes=False, verbose=False, dump=False)

Executes csv export of a MySQL database table with select columns.

Parameters
  • db_filter (Filter) – A connected and fully built Filter object.

  • export_path (Path) – Path to a dir for file creation.

  • folder_path (Path) – Path to a top-level dir.

  • table (str) – MySQL table name.

  • conditionals (list[BinaryExpression]) – MySQL WHERE clause-related SQLAlchemy objects.

  • sort (list[Column]) – A list of SQLAlchemy Columns to sort by.

  • values (list[str]) – List of values to fitler database results.

  • verbose (bool) – A boolean value to toggle progress print statements.

  • dump (bool) – A boolean value to toggle dump in current working dir.

pdm_utils.pipelines.export_db.execute_export(alchemist, pipeline, folder_path=None, folder_name='20220119_export', values=None, verbose=False, dump=False, force=False, table='phage', filters='', groups=[], sort=[], include_columns=[], exclude_columns=[], sequence_columns=False, raw_bytes=False, concatenate=False, db_name=None, phams_out=False, threads=1)

Executes the entirety of the file export pipeline.

Parameters
  • alchemist (AlchemyHandler) – A connected and fully built AlchemyHandler object.

  • pipeline (str) – File type that dictates data processing.

  • folder_path (Path) – Path to a valid dir for new dir creation.

  • folder_name (str) – A name for the export folder.

  • force (bool) – A boolean to toggle aggresive building of directories.

  • values (list[str]) – List of values to filter database results.

  • verbose (bool) – A boolean value to toggle progress print statements.

  • dump (bool) – A boolean value to toggle dump in current working dir.

  • table (str) – MySQL table name.

  • filters (str) – A list of lists with filter values, grouped by ORs.

  • groups (list[str]) – A list of supported MySQL column names to group by.

  • sort (list[str]) – A list of supported MySQL column names to sort by.

  • include_columns (list[str]) – A csv export column selection parameter.

  • exclude_columns (list[str]) – A csv export column selection parameter.

  • sequence_columns (bool) – A boolean to toggle inclusion of sequence data.

  • concatenate – A boolean to toggle concaternation for SeqRecords.

  • threads (int) – Number of processes/threads to spawn during the pipeline

pdm_utils.pipelines.export_db.execute_ffx_export(alchemist, export_path, folder_path, values, file_format, db_version, table, concatenate=False, data_cache=None, verbose=False, dump=False, threads=1, export_name=None)

Executes SeqRecord export of the compilation of data from a MySQL entry.

Parameters
  • alchemist (AlchemyHandler) – A connected and fully build AlchemyHandler object.

  • export_path (Path) – Path to a dir for file creation.

  • folder_path (Path) – Path to a top-level dir.

  • file_format (str) – Biopython supported file type.

  • db_version (dict) – Dictionary containing database version information.

  • table (str) – MySQL table name.

  • values (list[str]) – List of values to fitler database results.

  • conditionals (list[BinaryExpression]) – MySQL WHERE clause-related SQLAlchemy objects.

  • sort (list[Column]) – A list of SQLAlchemy Columns to sort by.

  • concatenate – A boolean to toggle concatenation of SeqRecords.

  • verbose (bool) – A boolean value to toggle progress print statements.

pdm_utils.pipelines.export_db.execute_sql_export(alchemist, export_path, folder_path, db_version, db_name=None, dump=False, force=False, phams_out=False, threads=1, verbose=False)
pdm_utils.pipelines.export_db.filter_csv_columns(alchemist, table, include_columns=[], exclude_columns=[], sequence_columns=False)

Function that filters and constructs a list of Columns to select.

Parameters
  • alchemist (AlchemyHandler) – A connected and fully built AlchemyHandler object.

  • table (str) – MySQL table name.

  • include_columns (list[str]) – A list of supported MySQL column names.

  • exclude_columns (list[str]) – A list of supported MySQL column names.

  • sequence_columns (bool) – A boolean to toggle inclusion of sequence data.

Returns

A list of SQLAlchemy Column objects.

Return type

list[Column]

pdm_utils.pipelines.export_db.get_cds_seqrecords(alchemist, values, data_cache=None, nucleotide=False, verbose=False, file_format=None)
pdm_utils.pipelines.export_db.get_genome_seqrecords(alchemist, values, data_cache=None, verbose=False)
pdm_utils.pipelines.export_db.get_single_genome(alchemist, phageid, get_features=False, data_cache=None)
pdm_utils.pipelines.export_db.get_sort_columns(alchemist, sort_inputs)

Function that converts input for sorting to SQLAlchemy Columns.

Parameters
  • alchemist (AlchemyHandler) – A connected and fully build AlchemyHandler object.

  • sort_inputs (list[str]) – A list of supported MySQL column names.

Returns

A list of SQLAlchemy Column objects.

Return type

list[Column]

pdm_utils.pipelines.export_db.main(unparsed_args_list)

Uses parsed args to run the entirety of the file export pipeline.

Parameters

unparsed_args_list (list[str]) – Input a list of command line args.

pdm_utils.pipelines.export_db.parse_export(unparsed_args_list)

Parses export_db arguments and stores them with an argparse object.

Parameters

unparsed_args_list (list[str]) – Input a list of command line args.

Returns

ArgParse module parsed args.

pdm_utils.pipelines.export_db.parse_feature_data(alchemist, values=[], limit=8000)

Returns Cds objects containing data parsed from a MySQL database.

Parameters
  • alchemist (AlchemyHandler) – A connected and fully built AlchemyHandler object.

  • values (list[str]) – List of GeneIDs upon which the query can be conditioned.

pdm_utils.pipelines.find_domains module

pdm_utils.pipelines.find_domains.clear_domain_data(engine)

Delete all domain data stored in the database.

pdm_utils.pipelines.find_domains.construct_domain_stmt(data_dict)

Construct the SQL statement to insert data into the domain table.

pdm_utils.pipelines.find_domains.construct_gene_domain_stmt(data_dict, gene_id)

Construct the SQL statement to insert data into the gene_domain table.

pdm_utils.pipelines.find_domains.construct_gene_update_stmt(gene_id)

Construct the SQL statement to update data in the gene table.

pdm_utils.pipelines.find_domains.construct_sql_txn(gene_id, rps_data_list)

Map domain data back to gene_id and create SQL statements for one transaction.

rps_data_list is a list of dictionaries, where each dictionary reflects a significat rpsblast domain hit.

pdm_utils.pipelines.find_domains.construct_sql_txns(cds_trans_dict, rpsblast_results)

Construct the list of SQL transactions.

pdm_utils.pipelines.find_domains.create_cds_translation_dict(cdd_genes)

Create a dictionary of genes and translations.

Returns a dictionary, where: key = unique translation value = set of GeneIDs with that translation.

pdm_utils.pipelines.find_domains.create_results_dict(search_results)

Create a dictionary of search results

Input is a list of dictionaries, one dict per translation, where: keys = “Translation” and “Data”, where key = “Translation” has value = translation, key = “Data”” has value = list of rpsblast results, where Each result element is a dictionary containing domain and gene_domain data.

Returns a dictionary, where: key = unique translation, value = list of dictionaries, each dictionary a unique rpsblast result

pdm_utils.pipelines.find_domains.execute_statement(connection, statement)
pdm_utils.pipelines.find_domains.execute_transaction(connection, statement_list=[])
pdm_utils.pipelines.find_domains.get_rpsblast_command()

Determine rpsblast+ command based on operating system.

pdm_utils.pipelines.find_domains.get_rpsblast_path(command)

Determine rpsblast+ binary path.

pdm_utils.pipelines.find_domains.insert_domain_data(engine, results)

Attempt to insert domain data into the database.

pdm_utils.pipelines.find_domains.learn_cdd_name(cdd_dir)
pdm_utils.pipelines.find_domains.log_gene_ids(cdd_genes)

Record names of the genes processed for reference.

pdm_utils.pipelines.find_domains.main(argument_list)
Parameters

argument_list

Returns

pdm_utils.pipelines.find_domains.make_tempdir(tmp_dir)

Uses pdm_utils.functions.basic.expand_path to expand TMP_DIR; then checks whether tmpdir exists - if it doesn’t, uses os.makedirs to make it recursively. :param tmp_dir: location where I/O should take place :return:

pdm_utils.pipelines.find_domains.process_align(align)

Process alignment data.

Returns description, domain_id, and name.

pdm_utils.pipelines.find_domains.process_rps_output(filepath, evalue)

Process rpsblast output and return list of dictionaries.

pdm_utils.pipelines.find_domains.search_and_process(rpsblast, cdd_name, tmp_dir, evalue, translation_id, translation)

Uses rpsblast to search indicated gene against the indicated CDD :param rpsblast: path to rpsblast binary :param cdd_name: CDD database path/name :param tmp_dir: path to directory where I/O will take place :param evalue: evalue cutoff for rpsblast :param translation_id: unique identifier for the translation sequence :param translation: protein sequence for gene to query :return: results

pdm_utils.pipelines.find_domains.search_summary(rolled_back)

Print search results.

pdm_utils.pipelines.find_domains.search_translations(rpsblast, cdd_name, tmp_dir, evalue, threads, engine, unique_trans, cds_trans_dict)

Search for conserved domains in a list of unique translations.

pdm_utils.pipelines.find_domains.setup_argparser()

Builds argparse.ArgumentParser for this script :return:

pdm_utils.pipelines.freeze_db module

Pipeline to freeze a database.

pdm_utils.pipelines.freeze_db.add_filters(filter_obj, filters)

Add filters from command line to filter object.

pdm_utils.pipelines.freeze_db.construct_count_query(table, primary_key, phage_id_set)

Construct SQL query to determine count.

pdm_utils.pipelines.freeze_db.construct_delete_stmt(table, primary_key, phage_id_set)

Construct SQL query to determine count.

pdm_utils.pipelines.freeze_db.construct_set_string(phage_id_set)

Convert set of phage_ids to string formatted for MySQL.

e.g. set: {‘Trixie’, ‘L5’, ‘D29’} returns: “(‘Trixie’, ‘L5’, ‘D29’)””

pdm_utils.pipelines.freeze_db.get_prefix()

Allow user to select appropriate prefix for the new database.

pdm_utils.pipelines.freeze_db.main(unparsed_args_list)

Run main freeze database pipeline.

pdm_utils.pipelines.freeze_db.parse_args(unparsed_args_list)

Verify the correct arguments are selected.

pdm_utils.pipelines.get_data module

Pipeline to gather new data to be imported into a MySQL database.

pdm_utils.pipelines.get_data.check_record_date(record_list, accession_dict)

Check whether the GenBank record is new.

pdm_utils.pipelines.get_data.compare_data(gnm_pair)

Compare data and create update tickets.

pdm_utils.pipelines.get_data.compute_genbank_tallies(results)

Tally results from GenBank retrieval.

pdm_utils.pipelines.get_data.convert_tickets_to_dict(list_of_tickets)

Convert list of tickets to list of dictionaries.

pdm_utils.pipelines.get_data.create_accession_sets(genome_dict)

Generate set of unique and non-unique accessions.

Input is a dictionary of pdm_utils genome objects.

pdm_utils.pipelines.get_data.create_draft_ticket(name)

Create ImportTicket for draft genome.

pdm_utils.pipelines.get_data.create_genbank_ticket(gnm)

Create ImportTicket for GenBank record.

pdm_utils.pipelines.get_data.create_phagesdb_ticket(phage_id)

Create ImportTicket for PhagesDB genome.

pdm_utils.pipelines.get_data.create_results_dict(gnm, genbank_date, result)

Create a dictionary of data summarizing NCBI retrieval status.

pdm_utils.pipelines.get_data.create_ticket_table(tickets, output_folder)

Save tickets associated with retrieved from GenBank files.

pdm_utils.pipelines.get_data.create_update_ticket(field, value, key_value)

Create update ticket.

pdm_utils.pipelines.get_data.get_accessions_to_retrieve(summary_records, accession_dict)

Review GenBank summary to determine which records are new.

pdm_utils.pipelines.get_data.get_draft_data(output_path, phage_id_set)

Run sub-pipeline to retrieve auto-annotated ‘draft’ genomes.

pdm_utils.pipelines.get_data.get_final_data(output_folder, matched_genomes)

Run sub-pipeline to retrieve ‘final’ genomes from PhagesDB.

pdm_utils.pipelines.get_data.get_genbank_data(output_folder, genome_dict, ncbi_cred_dict={}, genbank_results=False, force=False)

Run sub-pipeline to retrieve genomes from GenBank.

pdm_utils.pipelines.get_data.get_matched_drafts(matched_genomes)

Generate a list of matched ‘draft’ genomes.

pdm_utils.pipelines.get_data.get_update_data(output_folder, matched_genomes)

Run sub-pipeline to retrieve field updates from PhagesDB.

pdm_utils.pipelines.get_data.main(unparsed_args_list)

Run main retrieve_updates pipeline.

pdm_utils.pipelines.get_data.match_genomes(dict1, dict2)

Match MySQL database genome data to PhagesDB genome data.

Both dictionaries: Key = PhageID Value = pdm_utils genome object

pdm_utils.pipelines.get_data.output_genbank_summary(output_folder, results)

Save summary of GenBank retrieval results to file.

pdm_utils.pipelines.get_data.parse_args(unparsed_args_list)

Verify the correct arguments are selected for getting updates.

pdm_utils.pipelines.get_data.print_genbank_tallies(tallies)

Print results of GenBank retrieval.

pdm_utils.pipelines.get_data.print_match_results(dict)

Print results of genome matching.

pdm_utils.pipelines.get_data.process_failed_retrieval(accession_list, accession_dict)

Create list of dictionaries for records that could not be retrieved.

pdm_utils.pipelines.get_data.retrieve_drafts(output_folder, phage_list)

Retrieve auto-annotated ‘draft’ genomes from PECAAN.

pdm_utils.pipelines.get_data.retrieve_records(accession_dict, ncbi_folder, batch_size=200)

Retrieve GenBank records.

pdm_utils.pipelines.get_data.save_and_tickets(record_list, accession_dict, output_folder)

Save flat files retrieved from GenBank and create import tickets.

pdm_utils.pipelines.get_data.save_genbank_file(seqrecord, accession, name, output_folder)

Save retrieved record to file.

pdm_utils.pipelines.get_data.save_pecaan_file(response, name, output_folder)

Save data retrieved from PECAAN.

pdm_utils.pipelines.get_data.save_phagesdb_file(data, gnm, output_folder)

Save file retrieved from PhagesDB.

pdm_utils.pipelines.get_data.set_phagesdb_gnm_date(gnm)

Set the date of a PhagesDB genome object.

pdm_utils.pipelines.get_data.set_phagesdb_gnm_file(gnm)

Set the filename of a PhagesDB genome object.

pdm_utils.pipelines.get_data.sort_by_accession(genome_dict, force=False)

Sort genome objects based on their accession status.

Only retain data if genome is set to be automatically updated, there is a valid accession, and the accession is unique.

pdm_utils.pipelines.get_db module

Pipeline to install a pdm_utils MySQL database.

pdm_utils.pipelines.get_db.execute_get_file_db(alchemist, database, filename, config_file=None, schema_version=None, verbose=False)
pdm_utils.pipelines.get_db.execute_get_new_db(alchemist, database, schema_version, config_file=None, verbose=False)
pdm_utils.pipelines.get_db.execute_get_server_db(alchemist, database, url, folder_path=None, folder_name='20220119_get_db', db_name=None, config_file=None, verbose=False, subdirectory=None, download_only=False, get_fastas=False, get_alns=False, force_pull=False, get_version=False, schema_version=None)
pdm_utils.pipelines.get_db.install_db(alchemist, database, db_filepath=None, config_file=None, schema_version=None, verbose=False, pipeline=False)

Install database. If database already exists, it is first removed. :param database: Name of the database to be installed :type database: str :param db_filepath: Directory for installation :type db_filepath: Path

pdm_utils.pipelines.get_db.main(unparsed_args_list)

Run the get_db pipeline.

The database data can be retrieved from three places: The server, which needs to be downloaded to a new folder. A local file, in which no download and no new folder are needed. The empty schema stored within pdm_utils, in which no download, new folder, or local file are needed.

Parameters

unparsed_args_list (list) – list of arguments to run the pipeline unparsed

pdm_utils.pipelines.get_db.parse_args(unparsed_args_list)

Verify the correct arguments are selected for getting a new database. :param unparsed_args_list: arguments in sys.argv format :type unparsed_args_list: list :returns: A parsed list of arguments :rtype: argparse.Namespace

pdm_utils.pipelines.get_db.prepare_download(local_folder, url_folder, db_name, extension, verbose=False)

Construct filepath and check if it already exists, then download. :param local_folder: Working directory where the database is downloaded :type local_folder: Path :param url_folder: Base url where db_files are located. :type url_folder: str :param db_name: Name of the database to be downloaded :type db_name: str :param extension: file extension for the database :type extension: str :returns: Path to the destination directory and the status of the download :rtype: Path, bool

pdm_utils.pipelines.get_gb_records module

Pipeline to retrieve GenBank records using accessions stored in the MySQL database.

pdm_utils.pipelines.get_gb_records.copy_gb_data(ncbi_handle, acc_id_dict, records_path, file_type, verbose=False)

Save retrieved records to file.

pdm_utils.pipelines.get_gb_records.execute_get_gb_records(alchemist, file_type, folder_path=None, folder_name='20220119_gb_records', config=None, values=None, verbose=False, force=False, filters='', groups=[])

Executes the entirety of the get_gb_records pipeline

Parameters
  • alchemist (AlchemyHandler) – A connected and fully build AlchemyHandler object.

  • folder_path (Path) – Path to a valid dir for new dir creation.

  • folder_name (str) – A name for the export folder.

  • file_type (str) – File type to be exported.

  • config (ConfigParser) – ConfigParser object containing NCBI credentials.

  • force (bool) – A boolean to toggle aggresive building of directories.

  • values (list[str]) – List of values to filter database results.

  • verbose (bool) – A boolean value to toggle progress print statemtns.

  • filters – A List of lists with filter value,grouped by ORs.

  • groups (list[str]) – A list of supported MySQL column names to goup by.

pdm_utils.pipelines.get_gb_records.main(unparsed_args_list)

Run main get_gb_records pipeline.

pdm_utils.pipelines.get_gb_records.parse_args(unparsed_args_list)

Parses export_db arguments and stores them with an argparse object.

Parameters

unparsed_args_list (list[str]) – Input a list of command line args.

Returns

ArgParse module parsed args.

pdm_utils.pipelines.import_genome module

Primary pipeline to process and evaluate data to be imported into the MySQL database.

pdm_utils.pipelines.import_genome.check_bundle(bndl, ticket_ref='', file_ref='', retrieve_ref='', retain_ref='')

Check a Bundle for errors.

Evaluate whether all genomes have been successfully grouped, and whether all genomes have been paired, as expected. Based on the ticket type, there are expected to be certain types of genomes and pairs of genomes in the bundle.

Parameters
  • bndl – same as for run_checks().

  • ticket_ref – same as for prepare_bundle().

  • file_ref – same as for prepare_bundle().

  • retrieve_ref – same as for prepare_bundle().

  • retain_ref – same as for prepare_bundle().

pdm_utils.pipelines.import_genome.check_cds(cds_ftr, eval_flags, description_field='product')

Check a Cds object for errors.

Parameters
  • cds_ftr (Cds) – A pdm_utils Cds object.

  • eval_flags (dicts) – Dictionary of boolean evaluation flags.

  • description_field (str) – Description field to check against.

pdm_utils.pipelines.import_genome.check_genome(gnm, tkt_type, eval_flags, phage_id_set={}, seq_set={}, host_genus_set={}, cluster_set={}, subcluster_set={}, accession_set={})

Check a Genome object parsed from file for errors.

Parameters
  • gnm (Genome) – A pdm_utils Genome object.

  • tkt (Ticket) – A pdm_utils Ticket object.

  • eval_flags (dicts) – Dictionary of boolean evaluation flags.

  • phage_id_set (set) – Set of PhageIDs to check against.

  • seq_set (set) – Set of genome sequences to check against.

  • host_genus_set (set) – Set of host genera to check against.

  • cluster_set (set) – Set of clusters to check against.

  • subcluster_set (set) – Set of subclusters to check against.

  • accession_set (set) – Set of accessions to check against.

pdm_utils.pipelines.import_genome.check_retain_genome(gnm, tkt_type, eval_flags)

Check a Genome object currently in database for errors.

Parameters
  • gnm (Genome) – A pdm_utils Genome object.

  • tkt_type (str) – ImportTicket type

  • eval_flags (dicts) – Dictionary of boolean evaluation flags.

pdm_utils.pipelines.import_genome.check_source(src_ftr, eval_flags, host_genus='')

Check a Source object for errors.

Parameters
  • src_ftr (Source) – A pdm_utils Source object.

  • eval_flags (dicts) – Dictionary of boolean evaluation flags.

  • host_genus (str) – Host genus to check against.

pdm_utils.pipelines.import_genome.check_ticket(tkt, type_set={}, description_field_set={}, eval_mode_set={}, id_dupe_set={}, phage_id_dupe_set={}, retain_set={}, retrieve_set={}, add_set={}, parse_set={})

Evaluate a ticket to confirm it is structured appropriately.

The assumptions for how each field is populated varies depending on the type of ticket.

Parameters
  • tkt – same as for set_cds_descriptions().

  • type_set (set) – Set of ImportTicket types to check against.

  • description_field_set (set) – Set of description fields to check against.

  • eval_mode_set (set) – Set of evaluation modes to check against.

  • id_dupe_set (set) – Set of duplicated ImportTicket ids to check against.

  • phage_id_dupe_set (set) – Set of duplicated ImportTicket PhageIDs to check against.

  • retain_set (set) – Set of retain values to check against.

  • retrieve_set (set) – Set of retrieve values to check against.

  • add_set (set) – Set of add values to check against.

  • parse_set (set) – Set of parse values to check against.

pdm_utils.pipelines.import_genome.check_tmrna(tmrna_ftr, eval_flags)

Check a Tmrna object for errors.

Parameters
  • tmrna_ftr (Tmrna) – A pdm_utils Cds object.

  • eval_flags (dicts) – Dictionary of boolean evaluation flags.

pdm_utils.pipelines.import_genome.check_trna(trna_ftr, eval_flags)

Check a Trna object for errors.

Parameters
  • trna_ftr (Trna) – A pdm_utils Trna object.

  • eval_flags (dicts) – Dictionary of boolean evaluation flags.

pdm_utils.pipelines.import_genome.compare_genomes(genome_pair, eval_flags)

Compare two genomes to identify discrepancies.

Parameters
  • genome_pair (GenomePair) – A pdm_utils GenomePair object.

  • eval_flags (dicts) – Dictionary of boolean evaluation flags.

pdm_utils.pipelines.import_genome.data_io(engine=None, genome_folder=PosixPath('.'), import_table_file=PosixPath('.'), genome_id_field='', host_genus_field='', prod_run=False, description_field='', eval_mode='', output_folder=PosixPath('.'), interactive=False, accept_warning=False)

Set up output directories, log files, etc. for import.

Parameters
  • engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.

  • genome_folder (Path) – Path to the folder of flat files.

  • import_table_file (Path) – Path to the import table file.

  • genome_id_field (str) – The SeqRecord attribute that stores the genome identifier/name.

  • host_genus_field (str) – The SeqRecord attribute that stores the host genus identifier/name.

  • prod_run (bool) – Indicates whether MySQL statements will be executed.

  • description_field (str) – The SeqFeature attribute that stores the feature’s description.

  • eval_mode (str) – Name of the evaluation mode to evaluation genomes.

  • output_folder (Path) – Path to the folder to store results.

  • interactive (bool) – Indicates whether user is able to interact with genome evaluations at run time

  • accept_warning (bool) – Toggles whether the import pipeline will accept warnings without interactivity.

pdm_utils.pipelines.import_genome.get_logfile_path(bndl, paths_dict=None, filepath=None, file_ref=None)

Choose the path to output the file-specific log.

Parameters
  • bndl – same as for run_checks().

  • paths_dict (dict) – Dictionary indicating paths to success and fail folders.

  • filepath (Path) – Path to flat file.

  • file_ref – same as for prepare_bundle().

Returns

Path to log file to store flat-file-specific evaluations. If paths_dict is set to None, then None is returned instead of a path.

Return type

Path

pdm_utils.pipelines.import_genome.get_mysql_reference_sets(engine)

Get multiple sets of data from the MySQL database for reference.

Parameters

engine – same as for data_io().

Returns

Dictionary of unique PhageIDs, clusters, subclusters, host genera, accessions, and sequences stored in the MySQL database.

Return type

dict

pdm_utils.pipelines.import_genome.get_phagesdb_reference_sets()

Get multiple sets of data from PhagesDB for reference.

Returns

Dictionary of unique clusters, subclusters, and host genera stored on PhagesDB.

Return type

dict

pdm_utils.pipelines.import_genome.get_result_string(object, attr_list)

Construct string of values from several object attributes.

Parameters
  • object (misc) – A object from which to retrieve values.

  • attr_list (list) – List of strings indicating attributes to retrieve from the object.

Returns

A concatenated string representing values from all attributes.

Return type

str

pdm_utils.pipelines.import_genome.import_into_db(bndl, engine=None, gnm_key='', prod_run=False)

Import data into the MySQL database.

Parameters
  • bndl – same as for run_checks().

  • engine – same as for data_io().

  • gnm_key (str) – Identifier for the Genome object in the Bundle’s genome dictionary.

  • prod_run – same as for data_io().

pdm_utils.pipelines.import_genome.log_and_print(msg, terminal=False)

Print message to terminal in addition to logger if needed.

Parameters
  • msg (str) – Message to print.

  • terminal (bool) – Indicates if message should be printed to terminal.

pdm_utils.pipelines.import_genome.log_evaluations(dict_of_dict_of_lists, logfile_path=None)

Export evaluations to log.

Parameters
  • dict_of_dict_of_lists (dict) – Dictionary of evaluation dictionaries. Key1 = Bundle ID. Value1 = dictionary for each object in the Bundle. Key2 = object type (‘bundle’, ‘ticket’, etc.) Value2 = List of evaluation objects.

  • logfile_path (Path) – Path to the log file.

pdm_utils.pipelines.import_genome.main(unparsed_args_list)

Runs the complete import pipeline.

This is the only function of the pipeline that requires user input. All other functions can be implemented from other scripts.

Parameters

unparsed_args_list (list) – List of strings representing command line arguments.

pdm_utils.pipelines.import_genome.parse_args(unparsed_args_list)

Verify the correct arguments are selected for import new genomes.

Parameters

unparsed_args_list (list) – List of strings representing command line arguments.

Returns

ArgumentParser Namespace object containing the parsed args.

Return type

Namespace

pdm_utils.pipelines.import_genome.prepare_bundle(filepath=PosixPath('.'), ticket_dict={}, engine=None, genome_id_field='', host_genus_field='', id=None, file_ref='', ticket_ref='', retrieve_ref='', retain_ref='', id_conversion_dict={}, interactive=False)

Gather all genomic data needed to evaluate the flat file.

Parameters
  • filepath (Path) – Name of a GenBank-formatted flat file.

  • ticket_dict (dict) – A dictionary of pdm_utils ImportTicket objects.

  • engine – same as for data_io().

  • genome_id_field – same as for data_io().

  • host_genus_field – same as for data_io().

  • id (int) – Identifier to be assigned to the Bundle object.

  • file_ref (str) – Identifier for Genome objects derived from flat files.

  • ticket_ref (str) – Identifier for Genome objects derived from ImportTickets.

  • retrieve_ref (str) – Identifier for Genome objects derived from PhagesDB.

  • retain_ref (str) – Identifier for Genome objects derived from MySQL.

  • id_conversion_dict (dict) – Dictionary of PhageID conversions.

  • interactive – same as for data_io().

Returns

A pdm_utils Bundle object containing all data required to evaluate a flat file.

Return type

Bundle

pdm_utils.pipelines.import_genome.prepare_tickets(import_table_file=PosixPath('.'), eval_data_dict=None, description_field='', table_structure_dict={})

Prepare dictionary of pdm_utils ImportTickets.

Parameters
  • import_table_file – same as for data_io().

  • description_field – same as for data_io().

  • eval_data_dict (dict) – Evaluation data dictionary Key1 = “eval_mode” Value1 = Name of the eval_mode Key2 = “eval_flag_dict” Value2 = Dictionary of evaluation flags.

  • table_structure_dict (dict) – Dictionary describing structure of the import table.

Returns

Dictionary of pdm_utils ImportTicket objects. If a problem was encountered parsing the import table, None is returned.

Return type

dict

pdm_utils.pipelines.import_genome.process_files_and_tickets(ticket_dict, files_in_folder, engine=None, prod_run=False, genome_id_field='', host_genus_field='', interactive=False, log_folder_paths_dict=None, accept_warning=False)

Process GenBank-formatted flat files and import tickets.

Parameters
  • ticket_dict (dict) – A dictionary WHERE key (str) = The ticket’s phage_id value (Ticket) = The ticket

  • files_in_folder (list) – A list of filepaths to be parsed.

  • engine – same as for data_io().

  • prod_run – same as for data_io().

  • genome_id_field – same as for data_io().

  • host_genus_field – same as for data_io().

  • interactive – same as for data_io().

  • accept_warning – same as for data_io().

  • log_folder_paths_dict (dict) – Dictionary indicating paths to success and fail folders.

Returns

tuple of five objects WHERE [0] success_ticket_list (list) is a list of successful ImportTickets. [1] failed_ticket_list (list) is a list of failed ImportTickets. [2] success_filepath_list (list) is a list of successfully parsed flat files. [3] failed_filepath_list (list) is a list of unsuccessfully parsed flat files. [4] evaluation_dict (dict): dictionary from each Bundle, containing dictionaries for each bundled object, containing lists of evaluation objects.

Return type

tuple

pdm_utils.pipelines.import_genome.review_bundled_objects(bndl, interactive=False, accept_warning=False)

Review all evaluations of all bundled objects.

Iterate through all objects stored in the bundle. If there are warnings, review whether status should be changed.

Parameters
  • bndl – same as for run_checks().

  • interactive – same as for data_io().

  • accept_warnring – same as for data_io().

pdm_utils.pipelines.import_genome.review_cds_descriptions(feature_list, description_field)

Iterate through all CDS features and review descriptions.

Parameters
  • feature_list (list) – A list of pdm_utils Cds objects.

  • description_field – same as for data_io().

Returns

Name of the primary description_field after review.

Return type

str

pdm_utils.pipelines.import_genome.review_evaluation(evl, interactive=False, accept_warning=False)

Review an evaluation object.

Parameters
  • evl (Evaluation) – A pdm_utils Evaluation object.

  • interactive – same as for data_io().

  • accept_warning – same as for data_io().

Returns

tuple (exit, message) WHERE exit (bool) indicates whether user selected to exit the review process. correct(bool) indicates whether the evalution status is accurate.

Return type

tuple

pdm_utils.pipelines.import_genome.review_evaluation_list(evaluation_list, interactive=False, accept_warning=False)

Iterate through all evaluations and review ‘warning’ results.

Parameters
  • evaluation_list (list) – List of pdm_utils Evaluation objects.

  • interactive – same as for data_io().

  • accept_warning – same as for data_io().

Returns

Indicates whether user selected to exit the review process.

Return type

bool

pdm_utils.pipelines.import_genome.review_object_list(object_list, object_type, attr_list, interactive=False, accept_warning=False)

Determine if evaluations are present and record results.

Parameters
  • object_list (list) – List of pdm_utils objects containing evaluations.

  • object_type (str) – Name of the pdm_utils object.

  • attr_list (list) – List of attributes used to log data about the object instance.

  • interactive – same as for data_io().

  • accept_warning – same as for data_io().

pdm_utils.pipelines.import_genome.run_checks(bndl, accession_set={}, phage_id_set={}, seq_set={}, host_genus_set={}, cluster_set={}, subcluster_set={}, file_ref='', ticket_ref='', retrieve_ref='', retain_ref='')

Run checks on the different types of data in a Bundle object.

Parameters
  • bndl (Bundle) – A pdm_utils Bundle object containing bundled data.

  • accession_set (set) – Set of accessions to check against.

  • phage_id_set (set) – Set of PhageIDs to check against.

  • seq_set (set) – Set of nucleotide sequences to check against.

  • host_genus_set (set) – Set of host genera to check against.

  • cluster_set (set) – Set of Clusters to check against.

  • subcluster_set (set) – Set of Subclusters to check against.

  • file_ref – same as for prepare_bundle().

  • ticket_ref – same as for prepare_bundle().

  • retrieve_ref – same as for prepare_bundle().

  • retain_ref – same as for prepare_bundle().

pdm_utils.pipelines.import_genome.set_cds_descriptions(gnm, tkt, interactive=False)

Set the primary CDS descriptions.

Parameters
  • gnm (Genome) – A pdm_utils Genome object.

  • tkt (ImportTicket) – A pdm_utils ImportTicket object.

  • interactive – same as for data_io().

pdm_utils.pipelines.pham_finder module

Pipeline for mapping the differences of PhamIDs between dataabse

pdm_utils.pipelines.pham_finder.execute_pham_finder(alchemist, folder_path, folder_name, adatabase, bdatabase, values=None, filters='', groups=[], sort=[], show_per=False, use_locus=False, verbose=False)

Executes the entirety of the file export pipeline.

Parameters
  • alchemist (AlchemyHandler) – A connected and fully build AlchemyHandler object.

  • folder_path (Path) – Path to a valid dir for new dir creation.

  • folder_name (str) – A name for the export folder.

  • adatabase (str) – Name of reference database to source phams from.

  • bdatabase (str) – Name of database to find corresponding phams for.

  • values (list[str]) – List of values to filter database results:

  • verbose (bool) – A boolean value to toggle progress print statements.

  • table (str) – MySQL table name.

  • filters (str) – A list of lists with filter values, grouped by ORs.

  • groups (list[str]) – A list of supported MySQL column names to group by.

  • sort (list[str]) – A list of supported MySQL column names to sort by.

  • show_per (bool) – Enables display gene coverage of the corresponding phams.

  • use_locus (bool) – Toggles conversion between phams using LocusTag instead

pdm_utils.pipelines.pham_finder.find_phams(a_filter, b_filter, show_per=False, use_locus=False)

Find phams helper function that finds phams via GeneID intermediates.

Parameters
  • a_filter (Filter) – Fully built Filter connected to the reference database.

  • b_filter (Filter) – Fully build Filter connected to a database.

  • show_per (bool) – Enables display gene coverage of the corresponding phams.

Returns

Returns a dictionary mapping original phams to corresponding phams

Return type

dict{int:str}

pdm_utils.pipelines.pham_finder.main(unparsed_args_list)

Uses parsed args to run the entirety of the pham_finder pipeline.

Parameters

unparsed_args_list (list[str]) – Input a list of command line args.

pdm_utils.pipelines.pham_finder.parse_pham_finder(unparsed_args_list)

Parses pham_finder arguments and stores them with an argparse object.

Parameters

unparsed_args_list (list[str]) – Input a list of command line args.

Returns

ArgParse module parsed args.

pdm_utils.pipelines.phamerate module

Program to group related gene products into phamilies using either MMseqs2 for both similarity search and clustering, or blastp for similarity search and mcl for clustering.

pdm_utils.pipelines.phamerate.main(argument_list)
pdm_utils.pipelines.phamerate.refresh_tempdir(tmpdir)

Recursively deletes tmpdir if it exists, otherwise makes it :param tmpdir: directory to refresh :return:

pdm_utils.pipelines.phamerate.setup_argparser()

Builds argparse.ArgumentParser for this script :return:

pdm_utils.pipelines.push_db module

Pipeline to push files to a server using SFTP.

pdm_utils.pipelines.push_db.get_files(directory, file, ignore)

Get the list of file(s) that need to be uploaded.

Parameters
  • directory – (optional) directory containing files for upload

  • file (pathlib.Path) – (optional) file to upload

  • ignore (set) – file(s) to ignore during upload process

Type

directory: pathlib.Path

Returns

file_list

pdm_utils.pipelines.push_db.main(unparsed_args)

Driver function for the push pipeline.

Parameters

unparsed_args (list) – the command-line arguments given to this pipeline’s caller (likely pdm_utils.__main__)

pdm_utils.pipelines.push_db.parse_args(unparsed_args)

Verify the correct arguments are selected for uploading to the server.

pdm_utils.pipelines.push_db.upload(sftp_client, destination, files)

Try to upload the file(s).

Parameters
  • sftp_client (paramiko.SFTPClient) – an open SFTPClient

  • destination (pathlib.Path) – remote file directory to upload to

  • files (list of pathlib.Path) – the file(s) to upload

Returns

successes, failures

pdm_utils.pipelines.review module

pdm_utils.pipelines.revise module

Pipeline to automate product annotation resubmissions to GenBank.

pdm_utils.pipelines.revise.build_id_record_map(alchemist, phageids)
pdm_utils.pipelines.revise.build_revise_log_file(working_dir)
pdm_utils.pipelines.revise.clean_feature(feature)

Revise helper function to format a SeqFeature for Feature Table export.

Parameters

feature (Biopython) – Biopython SeqFeature object populated with MySQL data.

pdm_utils.pipelines.revise.create_feature_map(record)

Revise helper function to map all the qualities of one locus tag.

Parameters

record (SeqRecord) – SeqRecord object to map gene features for.

Returns

Returns a dictionary mapping locus_tags to features.

Return type

dict

pdm_utils.pipelines.revise.curate_record_product_discrepancies(target_record, template_record, verbose=False)
pdm_utils.pipelines.revise.execute_local_revise(alchemist, revisions_file_path, folder_path=None, folder_name='20220119_revise', config=None, input_type='function_report', output_type='p_curation', production=False, filters='', groups=[], force=False, verbose=False)

Executes the entirety of the genbank local revise pipeline.

Parameters
  • alchemist (AlchemyHandler) – A connected and fully built AlchemyHandler object.

  • revisions_data_dicts (list[dict]) – Data dictionaries containing pham/notes data.

  • folder_path (Path) – Path to a valid dir for new dir creation.

  • folder_name (str) – A name for the export folder.

  • input_type (str) – Specifies the file format of the input file

  • output_type (str) – Specifies the file format of the outputted file

  • production (bool) – Toggles additional filters for production-level revision

  • verbose (bool) – A boolean value to toggle progress print statements.

  • force (bool) – A boolean to toggle aggresive building of directories.

pdm_utils.pipelines.revise.execute_remote_revise(alchemist, folder_path=None, folder_name='20220119_revise', config=None, output_type='p_curation', values=None, filters='', groups=[], verbose=False, force=False)
pdm_utils.pipelines.revise.find_product_discrepancies(source_id_record_map, target_records, verbose=False)
pdm_utils.pipelines.revise.format_curation_data(row_dict)

Function to format revise dictionary keys to gb curation format.

Parameters
  • row_dict (dict) – Data dictionary for a revise file.

  • product (str) – Gene product to append to the revise data dictionary.

pdm_utils.pipelines.revise.format_update_ticket_data(row_dict)

Function to format revise dictionary keys to update ticket format.

Parameters
  • row_dict (dict) – Data dictionary for a revise file.

  • product (str) – Gene product to append to the revise data dictionary.

pdm_utils.pipelines.revise.get_tbl_records(acc_id_dict, ncbi_cred_dict={})
pdm_utils.pipelines.revise.main(unparsed_args_list)

Uses parsed args to run the entirety of the revise pipeline.

Parameters

unparsed_args_list (list[str]) – Input a list of command line args.

pdm_utils.pipelines.revise.parse_revise(unparsed_args_list)

Parses revise arguments and stores them with an argparse object.

Parameters

unparsed_args_list (list[str]) – Input a list of command line args.

Returns

ArgParse module parsed args.

pdm_utils.pipelines.revise.revise_cds_feature(target_cds, template_cds, locus_tag, verbose=False)
pdm_utils.pipelines.revise.revise_gene_feature(target_gene, template_gene, locus_tag, verbose=False)
pdm_utils.pipelines.revise.revise_seqrecord(target_record, template_record, verbose=False)

Function to edit a target record based on data from a template record.

Parameters
  • target_record (SeqRecord) – SeqRecord object to be changed based on the template.

  • template_record (SeqRecord) – SeqRecord object to be used as a source of data.

pdm_utils.pipelines.revise.revise_seqrecords(source_id_record_map, target_records, verbose=False)
pdm_utils.pipelines.revise.revise_trna_feature(target_trna, template_trna, locus_tag, verbose=False)
pdm_utils.pipelines.revise.use_csv_data(db_filter, data_dicts, columns, conditionals, verbose=False)

Reads in gene table csv data and pairs it with existing data.

Parameters
  • db_filter (Filter) – A connected and fully built Filter object.

  • data_dicts (list[dict]) – List of data dictionaries from a FunctionReport file.

  • columns (list[Column]) – List of SQLAlchemy Columns to retrieve data for.

  • conditionals (List[BinaryExpression]) – List of SQLAlchemy BinaryExpressions to filter with.

  • verbose (bool) – A boolean value to toggle progress print statements.

pdm_utils.pipelines.revise.use_function_report_data(db_filter, data_dicts, columns, conditionals, verbose=False)

Reads in FunctionReport data and pairs it with existing data.

Parameters
  • db_filter (Filter) – A connected and fully built Filter object.

  • data_dicts (list[dict]) – List of data dictionaries from a FunctionReport file.

  • columns (list[Column]) – List of SQLAlchemy Columns to retrieve data for.

  • conditionals (List[BinaryExpression]) – List of SQLAlchemy BinaryExpressions to filter with.

  • verbose (bool) – A boolean value to toggle progress print statements.

pdm_utils.pipelines.revise.write_revise_file(data_dicts, output_path, file_format='p_curation', file_name='revise.csv', verbose=False)

Writes a revision csv in the desired file format with necessary changes.

Parameters
  • data_dicts (list[dict]) – List of data dictionaries to convert to curation format.

  • output_path (Path) – Path to a dir for file creation.

  • file_format (str) – Format of the csv to be written.

  • file_name (str) – Name of the file to write curation data to.

  • verbose (bool) – A boolean value to toggle progress print statements.

pdm_utils.pipelines.update_field module

Pipeline to update specific fields in a MySQL database.

pdm_utils.pipelines.update_field.main(unparsed_args)

Runs the complete update pipeline.

pdm_utils.pipelines.update_field.parse_args(unparsed_args_list)

Verify the correct arguments are selected for getting updates.

pdm_utils.pipelines.update_field.update_field(alchemist, update_ticket)

Attempts to update a field using information from an update_ticket.

Parameters
  • alchemist (AlchemyHandler) – A connected and fully build AlchemyHandler object.

  • update_ticket (dict) – Dictionary with instructions to update a field.

Module contents