get_data

Pipeline to gather new data to be imported into a MySQL database.

pdm_utils.pipelines.get_data.check_record_date(record_list, accession_dict)

Check whether the GenBank record is new.

pdm_utils.pipelines.get_data.compare_data(gnm_pair)

Compare data and create update tickets.

pdm_utils.pipelines.get_data.compute_genbank_tallies(results)

Tally results from GenBank retrieval.

pdm_utils.pipelines.get_data.convert_tickets_to_dict(list_of_tickets)

Convert list of tickets to list of dictionaries.

pdm_utils.pipelines.get_data.create_accession_sets(genome_dict)

Generate set of unique and non-unique accessions.

Input is a dictionary of pdm_utils genome objects.

pdm_utils.pipelines.get_data.create_draft_ticket(name)

Create ImportTicket for draft genome.

pdm_utils.pipelines.get_data.create_genbank_ticket(gnm)

Create ImportTicket for GenBank record.

pdm_utils.pipelines.get_data.create_phagesdb_ticket(phage_id)

Create ImportTicket for PhagesDB genome.

pdm_utils.pipelines.get_data.create_results_dict(gnm, genbank_date, result)

Create a dictionary of data summarizing NCBI retrieval status.

pdm_utils.pipelines.get_data.create_ticket_table(tickets, output_folder)

Save tickets associated with retrieved from GenBank files.

pdm_utils.pipelines.get_data.create_update_ticket(field, value, key_value)

Create update ticket.

pdm_utils.pipelines.get_data.get_accessions_to_retrieve(summary_records, accession_dict)

Review GenBank summary to determine which records are new.

pdm_utils.pipelines.get_data.get_draft_data(output_path, phage_id_set)

Run sub-pipeline to retrieve auto-annotated ‘draft’ genomes.

pdm_utils.pipelines.get_data.get_final_data(output_folder, matched_genomes)

Run sub-pipeline to retrieve ‘final’ genomes from PhagesDB.

pdm_utils.pipelines.get_data.get_genbank_data(output_folder, genome_dict, ncbi_cred_dict={}, genbank_results=False, force=False)

Run sub-pipeline to retrieve genomes from GenBank.

pdm_utils.pipelines.get_data.get_matched_drafts(matched_genomes)

Generate a list of matched ‘draft’ genomes.

pdm_utils.pipelines.get_data.get_update_data(output_folder, matched_genomes)

Run sub-pipeline to retrieve field updates from PhagesDB.

pdm_utils.pipelines.get_data.main(unparsed_args_list)

Run main retrieve_updates pipeline.

pdm_utils.pipelines.get_data.match_genomes(dict1, dict2)

Match MySQL database genome data to PhagesDB genome data.

Both dictionaries: Key = PhageID Value = pdm_utils genome object

pdm_utils.pipelines.get_data.output_genbank_summary(output_folder, results)

Save summary of GenBank retrieval results to file.

pdm_utils.pipelines.get_data.parse_args(unparsed_args_list)

Verify the correct arguments are selected for getting updates.

pdm_utils.pipelines.get_data.print_genbank_tallies(tallies)

Print results of GenBank retrieval.

pdm_utils.pipelines.get_data.print_match_results(dict)

Print results of genome matching.

pdm_utils.pipelines.get_data.process_failed_retrieval(accession_list, accession_dict)

Create list of dictionaries for records that could not be retrieved.

pdm_utils.pipelines.get_data.retrieve_drafts(output_folder, phage_list)

Retrieve auto-annotated ‘draft’ genomes from PECAAN.

pdm_utils.pipelines.get_data.retrieve_records(accession_dict, ncbi_folder, batch_size=200)

Retrieve GenBank records.

pdm_utils.pipelines.get_data.save_and_tickets(record_list, accession_dict, output_folder)

Save flat files retrieved from GenBank and create import tickets.

pdm_utils.pipelines.get_data.save_genbank_file(seqrecord, accession, name, output_folder)

Save retrieved record to file.

pdm_utils.pipelines.get_data.save_pecaan_file(response, name, output_folder)

Save data retrieved from PECAAN.

pdm_utils.pipelines.get_data.save_phagesdb_file(data, gnm, output_folder)

Save file retrieved from PhagesDB.

pdm_utils.pipelines.get_data.set_phagesdb_gnm_date(gnm)

Set the date of a PhagesDB genome object.

pdm_utils.pipelines.get_data.set_phagesdb_gnm_file(gnm)

Set the filename of a PhagesDB genome object.

pdm_utils.pipelines.get_data.sort_by_accession(genome_dict, force=False)

Sort genome objects based on their accession status.

Only retain data if genome is set to be automatically updated, there is a valid accession, and the accession is unique.