pdm_utils.functions package

Submodules

pdm_utils.functions.annotation module

Functions to retrieve phage genome annotation data.

pdm_utils.functions.annotation.get_adjacent_genes(alchemist, gene)
pdm_utils.functions.annotation.get_annotations_from_genes(alchemist, geneids)
pdm_utils.functions.annotation.get_count_adjacent_annotations_to_pham(alchemist, pham, incounts=None)
pdm_utils.functions.annotation.get_count_adjacent_phams_to_pham(alchemist, pham, incounts=None)
pdm_utils.functions.annotation.get_count_annotations_in_genes(alchemist, geneids, incounts=None)
pdm_utils.functions.annotation.get_count_annotations_in_pham(alchemist, pham, incounts=None)
pdm_utils.functions.annotation.get_count_phams_in_genes(alchemist, geneids, incounts=None)
pdm_utils.functions.annotation.get_distinct_adjacent_phams(alchemist, pham)
pdm_utils.functions.annotation.get_distinct_annotations_from_genes(alchemist, geneids)
pdm_utils.functions.annotation.get_distinct_phams_from_genes(alchemist, geneids)
pdm_utils.functions.annotation.get_genes_adjacent_to_pham(alchemist, pham)
pdm_utils.functions.annotation.get_genes_from_pham(alchemist, pham)
pdm_utils.functions.annotation.get_phams_from_genes(alchemist, geneids)
pdm_utils.functions.annotation.get_relative_gene(alchemist, geneid, pos)

pdm_utils.functions.basic module

Misc. base/simple functions. These should not require import of other modules in this package to prevent circular imports.

pdm_utils.functions.basic.ask_yes_no(prompt='', response_attempt=1)

Function to get the user’s yes/no response to a question.

Accepts variations of yes/y, true/t, no/n, false/f, exit/quit/q.

Parameters
  • prompt (str) – the question to ask the user.

  • response_attempt (int) – The number of the number of attempts allowed before the function exits. This prevents the script from getting stuck in a loop.

Returns

The default is False (e.g. user hits Enter without typing anything else), but variations of yes or true responses will return True instead. If the response is ‘exit’ or ‘quit’, the loop is exited and None is returned.

Return type

bool, None

pdm_utils.functions.basic.check_empty(value, lower=True)

Checks if the value represents a null value.

Parameters
  • value (misc.) – Value to be checked against the empty set.

  • lower (bool) – Indicates whether the input value should be lowercased prior to checking.

Returns

Indicates whether the value is present in the empty set.

Return type

bool

pdm_utils.functions.basic.check_value_expected_in_set(value, set1, expect=True)

Check if a value is present within a set and if it is expected.

Parameters
  • value (misc.) – The value to be checked.

  • set1 (set) – The reference set of values.

  • expect (bool) – Indicates if ‘value’ is expected to be present in ‘set1’.

Returns

The result of the evaluation.

Return type

bool

pdm_utils.functions.basic.check_value_in_two_sets(value, set1, set2)

Check if a value is present within two sets.

Parameters
  • value (misc.) – The value to be checked.

  • set1 (set) – The first reference set of values.

  • set2 (set) – The second reference set of values.

Returns

The result of the evaluation, indicating whether the value is present within:

  1. only the ‘first’ set

  2. only the ‘second’ set

  3. ’both’ sets

  4. ’neither’ set

Return type

str

pdm_utils.functions.basic.choose_from_list(options)

Iterate through a list of values and choose a value.

Parameters

options (list) – List of options to choose from.

Returns

the user select option of None

Return type

option or None

pdm_utils.functions.basic.choose_most_common(string, values)

Identify most common occurrence of several values in a string.

Parameters
  • string (str) – String to search.

  • values (list) – List of string characters. The order in the list indicates preference, in the case of a tie.

Returns

Value from values that occurs most.

Return type

str

pdm_utils.functions.basic.clear_screen()

Brings the command line to the top of the screen.

pdm_utils.functions.basic.compare_cluster_subcluster(cluster, subcluster)

Check if a cluster and subcluster designation are compatible.

Parameters
  • cluster (str) – The cluster value to be compared. ‘Singleton’ and ‘UNK’ are lowercased.

  • subcluster (str) – The subcluster value to be compared.

Returns

The result of the evaluation, indicating whether the two values are compatible.

Return type

bool

pdm_utils.functions.basic.compare_sets(set1, set2)

Compute the intersection and differences between two sets.

Parameters
  • set1 (set) – The first input set.

  • set2 (set) – The second input set.

Returns

tuple (set_intersection, set1_diff, set2_diff) WHERE set_intersection(set) is the set of shared values. set1_diff(set) is the set of values unique to the first set. set2_diff(set) is the set of values unique to the second set.

Return type

tuple

pdm_utils.functions.basic.convert_empty(input_value, format, upper=False)

Converts common null value formats.

Parameters
  • input_value (str, int, datetime) – Value to be re-formatted.

  • format (str) – Indicates how the value should be edited. Valid format types include: ‘empty_string’ = ‘’ ‘none_string’ = ‘none’ ‘null_string’ = ‘null’ ‘none_object’ = None ‘na_long’ = ‘not applicable’ ‘na_short’ = ‘na’ ‘n/a’ = ‘n/a’ ‘zero_string’ = ‘0’ ‘zero_num’ = 0 ‘empty_datetime_obj’ = datetime object with arbitrary date, ‘1/1/0001’

  • upper (bool) – Indicates whether the output value should be uppercased.

Returns

The re-formatted value as indicated by ‘format’.

Return type

str, int, datetime

pdm_utils.functions.basic.convert_list_to_dict(data_list, key)

Convert list of dictionaries to a dictionary of dictionaries

Parameters
  • data_list (list) – List of dictionaries.

  • key (str) – key in each dictionary to become the returned dictionary key.

Returns

Dictionary of all dictionaries. Returns an empty dictionary if all intended keys are not unique.

Return type

dict

pdm_utils.functions.basic.convert_to_decoded(values)

Converts a list of strings to utf-8 encoded values.

Parameters

values (list[bytes]) – Byte values from MySQL queries to be decoded.

Returns

List of utf-8 decoded values.

Return type

list[str]

pdm_utils.functions.basic.convert_to_encoded(values)

Converts a list of strings to utf-8 encoded values.

Parameters

values (list[str]) – Strings for a MySQL query to be encoded.

Returns

List of utf-8 encoded values.

Return type

list[bytes]

pdm_utils.functions.basic.create_indices(input_list, batch_size)

Create list of start and stop indices to split a list into batches.

Parameters
  • input_list (list) – List from which to generate batch indices.

  • batch_size (int) – Size of each batch.

Returns

List of 2-element tuples (start index, stop index).

Return type

list

pdm_utils.functions.basic.edit_suffix(value, option, suffix='_Draft')

Adds or removes the indicated suffix to an input value.

Parameters
  • value (str) – Value that will be edited.

  • option (str) – Indicates what to do with the value and suffix (‘add’, ‘remove’).

  • suffix (str) – The suffix that will be added or removed.

Returns

The edited value. The suffix is not added if the input value already has the suffix.

Return type

str

pdm_utils.functions.basic.expand_path(input_path)

Convert a non-absolute path into an absolute path.

Parameters

input_path (str) – The path to be expanded.

Returns

The expanded path.

Return type

str

pdm_utils.functions.basic.find_expression(expression, list_of_items)

Counts the number of items with matches to a regular expression.

Parameters
  • expression (re) – Regular expression object

  • list_of_items (list) – List of items that will be searched with the regular expression.

Returns

Number of times the regular expression was identified in the list.

Return type

int

pdm_utils.functions.basic.get_user_pwd(user_prompt='Username: ', pwd_prompt='Password: ')

Get username and password.

Parameters
  • user_prompt (str) – Displayed description when prompted for username.

  • pwd_prompt (str) – Displayed description when prompted for password.

Returns

tuple (username, password) WHERE username(str) is the user-supplied username. password(str) is the user-supplied password.

Return type

tuple

pdm_utils.functions.basic.get_values_from_dict_list(list_of_dicts)

Convert a list of dictionaries to a set of the dictionary values.

Parameters

list_of_dicts (list) – List of dictionaries.

Returns

Set of values from all dictionaries in the list.

Return type

set

pdm_utils.functions.basic.get_values_from_tuple_list(list_of_tuples)

Convert a list of tuples to a set of the tuple values.

Parameters

list_of_tuples (list) – List of tuples.

Returns

Set of values from all tuples in the list.

Return type

set

pdm_utils.functions.basic.identify_contents(path_to_folder, kind=None, ignore_set={})

Create a list of filenames and/or folders from an indicated directory.

Parameters
  • path_to_folder (Path) – A valid directory path.

  • kind (str) – (“file”, “dir”), corresponding with paths to be checked as either files or directories.

  • ignore_set (set) – A set of strings representing file or folder names to ignore.

Returns

List of valid contents in the directory.

Return type

list

pdm_utils.functions.basic.identify_nested_items(complete_list)

Identify nested and non-nested two-element tuples in a list.

Parameters

complete_list (list) – List of tuples that will be evaluated.

Returns

tuple (not_nested_set, nested_set) WHERE not_nested_set(set) is a set of non-nested tuples. nested_set(set) is a set of nested tuples.

Return type

tuple

pdm_utils.functions.basic.identify_one_list_duplicates(item_list)

Identify duplicate items within a list.

Parameters

item_list (list) – The input list to be checked.

Returns

The set of non-unique/duplicated items.

Return type

set

pdm_utils.functions.basic.identify_two_list_duplicates(item1_list, item2_list)

Identify duplicate items between two lists.

Parameters
  • item1_list (list) – The first input list to be checked.

  • item2_list (list) – The second input list to be checked.

Returns

The set of non-unique/duplicated items between the two lists (but not duplicate items within each list).

Return type

set

pdm_utils.functions.basic.identify_unique_items(complete_list)

Identify unique and non-unique items in a list.

Parameters

complete_list (list) – List of items that will be evaluated.

Returns

tuple (unique_set, duplicate_set) WHERE unique_set(set) is a set of all unique/non-duplicated items. duplicate_set(set) is a set of non-unique/duplicated items. non-informative/generic data is removed.

Return type

tuple

pdm_utils.functions.basic.increment_histogram(data, histogram)

Increments a dictionary histogram based on given data.

Parameters
  • data (list) – Data to be used to index or create new keys in the histogram.

  • histogram (dict) – Dictionary containing keys whose values contain counts.

pdm_utils.functions.basic.invert_dictionary(dictionary)

Inverts a dictionary, where the values and keys are swapped.

Parameters

dictionary (dict) – A dictionary to be inverted.

Returns

Returns an inverted dictionary of the given dictionary.

Return type

dict

pdm_utils.functions.basic.is_float(string)

Check if string can be converted to float.

pdm_utils.functions.basic.join_strings(input_list, delimiter=' ')

Open file and retrieve a dictionary of data.

Parameters
  • input_list (list) – List of values to join.

  • delimiter (str) – Delimiter used between values.

Returns

Concatenated values, excluding all None and ‘’ values.

Return type

str

pdm_utils.functions.basic.lower_case(value)

Return the value lowercased if it is within a specific set of values.

Parameters

value (str) – The value to be checked.

Returns

The lowercased value if it is equivalent to ‘none’, ‘retrieve’, or ‘retain’.

Return type

str

pdm_utils.functions.basic.make_new_dir(output_dir, new_dir, attempt=1, mkdir=True)

Make a new directory.

Checks to verify the new directory name is valid and does not already exist. If it already exists, it attempts to extend the name with an integer suffix.

Parameters
  • output_dir (Path) – Full path to the directory where the new directory will be created.

  • new_dir (Path) – Name of the new directory to be created.

  • attempt (int) – Number of attempts to create the directory.

Returns

If successful, the full path of the created directory. If unsuccessful, None.

Return type

Path, None

pdm_utils.functions.basic.make_new_file(output_dir, new_file, ext, attempt=1)

Make a new file.

Checks to verify the new file name is valid and does not already exist. If it already exists, it attempts to extend the name with an integer suffix.

Parameters
  • output_dir (Path) – Full path to the directory where the new directory will be created.

  • new_file (Path) – Name of the new file to be created.

  • ext (str) – Name of the file extension to be used.

  • attempt (int) – Number of attempts to create the file.

Returns

If successful, the full path of the created file. If unsuccessful, None.

Return type

Path, None

pdm_utils.functions.basic.match_items(list1, list2)

Match values of two lists and return several results.

Parameters
  • list1 (list) – The first input list.

  • list2 (list) – The second input list.

Returns

tuple (matched_unique_items, set1_unmatched_unique_items, set2_unmatched_unique_items, set1_duplicate_items, set2_duplicate_items) WHERE matched_unique_items(set) is the set of matched unique values. set1_unmatched_unique_items(set) is the set of unmatched unique values from the first list. set2_unmatched_unique_items(set) is the set of unmatched unique values from the second list. set1_duplicate_items(set) is the the set of duplicate values from the first list. set2_duplicate_items(set) is the set of unmatched unique values from the second list.

Return type

tuple

pdm_utils.functions.basic.merge_set_dicts(dict1, dict2)

Merge two dictionaries of sets.

Parameters
  • dict1 (dict) – First dictionary of sets.

  • dict2 (dict) – Second dictionary of sets.

Returns

Merged dictionary containing all keys from both dictionaries, and for each shared key the value is a set of merged values.

Return type

dict

pdm_utils.functions.basic.parse_flag_file(flag_file)

Parse a file to an evaluation flag dictionary.

Parameters

flag_file (str) – A two-column csv-formatted file WHERE 1. evaluation flag 2. ‘True’ or ‘False’

Returns

A dictionary WHERE keys (str) are evaluation flags values (bool) indicate the flag setting Only flags that contain boolean values are returned.

Return type

dict

pdm_utils.functions.basic.parse_names_from_record_field(description)

Attempts to parse the phage/plasmid/prophage name and host genus from a given string. :param description: the input string to be parsed :type description: str :return: name, host_genus

pdm_utils.functions.basic.partition_list(data_list, size)

Chunks list into a list of lists with the given size.

Parameters
  • data_list (list) – List to be split into equal-sized lists.

  • size – Length of the resulting list chunks.

  • size – int

Returns

Returns list of lists with length of the given size.

Return type

list[list]

pdm_utils.functions.basic.prepare_filepath(folder_path, file_name, folder_name=None)

Prepare path to new file.

Parameters
  • folder_path (Path) – Path to the directory to contain the file.

  • file_name (str) – Name of the file.

  • folder_name (Path) – Name of sub-directory to create.

Returns

Path to file in directory.

Return type

Path

pdm_utils.functions.basic.reformat_coordinates(start, stop, current, new)

Converts common coordinate formats.

The type of coordinate formats include:

‘0_half_open’:

0-based half-open intervals that is the common format for BAM files and UCSC Browser database. This format seems to be more efficient when performing genomics computations.

‘1_closed’:

1-based closed intervals that is the common format for the MySQL Database, UCSC Browser, the Ensembl genomics database, VCF files, GFF files. This format seems to be more intuitive and used for visualization.

The function assumes coordinates reflect the start and stop boundaries (where the start coordinates is smaller than the stop coordinate), instead of transcription start and stop coordinates.

Parameters
  • start (int) – Start coordinate

  • stop (int) – Stop coordinate

  • current (str) – Indicates the indexing format of the input coordinates.

  • new (str) – Indicates the indexing format of the output coordinates.

Returns

The re-formatted start and stop coordinates.

Return type

int

pdm_utils.functions.basic.reformat_description(raw_description)

Reformat a gene description.

Parameters

raw_description (str) – Input value to be reformatted.

Returns

tuple (description, processed_description) WHERE description(str) is the original value stripped of leading and trailing whitespace. processed_description(str) is the reformatted value, in which non-informative/generic data is removed.

Return type

tuple

pdm_utils.functions.basic.reformat_strand(input_value, format, case=False)

Converts common strand orientation formats.

Parameters
  • input_value (str, int) – Value that will be edited.

  • format (str) – Indicates how the value should be edited. Valid format types include: ‘fr_long’ (‘forward’, ‘reverse’) ‘fr_short’ (‘f’, ‘r’) ‘fr_abbrev1’ (‘for’, ‘rev’) ‘fr_abbrev2’ (‘fwd’, ‘rev’) ‘tb_long’ (‘top’, ‘bottom’) ‘tb_short’ (‘t’, ‘b’) ‘wc_long’ (‘watson’, ‘crick’) ‘wc_short’ (‘w’,’c’) ‘operator’ (‘+’, ‘-‘) ‘numeric’ (1, -1).

  • case (bool) – Indicates whether the output value should be capitalized.

Returns

The re-formatted value as indicated by ‘format’.

Return type

str, int

pdm_utils.functions.basic.select_option(prompt, valid_response_set)

Select an option from a set of options.

Parameters
  • prompt (str) – Message to display before displaying option.

  • valid_response_set (set) – Set of valid options to choose.

Returns

option

Return type

str, int

pdm_utils.functions.basic.set_path(path, kind=None, expect=True)

Confirm validity of path argument.

Parameters
  • path (Path) – path

  • kind (str) – (“file”, “dir”), corresponding with paths to be checked as either files or directories.

  • expect (bool) – Indicates if the path is expected to the indicated kind.

Returns

Absolute path if valid, otherwise sys.exit is called.

Return type

Path

pdm_utils.functions.basic.sort_histogram(histogram, descending=True)

Sorts a dictionary by its values and returns the sorted histogram.

Parameters

histogram (dict) – Dictionary containing keys whose values contain counts.

Returns

An ordered dict from items from the histogram sorted by value.

Return type

OrderedDict

pdm_utils.functions.basic.sort_histogram_keys(histogram, descending=True)

Sorts a dictionary by its values and returns the sorted histogram.

Parameters

histogram (dict) – Dictionary containing keys whose values contain counts.

Returns

A list from keys from the histogram sorted by value.

Return type

list

pdm_utils.functions.basic.split_string(string)

Split a string based on alphanumeric characters.

Iterates through a string, identifies the first position in which the character is a float, and creates two strings at this position.

Parameters

string (str) – The value to be split.

Returns

tuple (left, right) WHERE left(str) is the left portion of the input value prior to the first numeric character and only contains alphabetic characters (or will be ‘’). right(str) is the right portion of the input value after the first numeric character and only contains numeric characters (or will be ‘’).

Return type

tuple

pdm_utils.functions.basic.trim_characters(string)

Remove leading and trailing generic characters from a string.

Parameters

string (str) – Value that will be trimmed. Characters that will be removed include: ‘.’, ‘,’, ‘;’, ‘-’, ‘_’.

Returns

Edited value.

Return type

str

pdm_utils.functions.basic.truncate_value(value, length, suffix)

Truncate a string.

Parameters
  • value (str) – String that should be truncated.

  • length (int) – Final length of truncated string.

  • suffix (str) – String that should be appended to truncated string.

Returns

the truncated string

Return type

str

pdm_utils.functions.basic.verify_path(filepath, kind=None)

Verifies that a given path exists.

Parameters
  • filepath (str) – full path to the desired file/directory.

  • kind (str) – (“file”, “dir”), corresponding with paths to be checked as either files or directories.

Return Boolean

True if path is verified, False otherwise.

pdm_utils.functions.basic.verify_path2(path, kind=None, expect=True)

Verifies that a given path exists.

Parameters
  • path (Path) – path

  • kind (str) – (“file”, “dir”), corresponding with paths to be checked as either files or directories.

  • expect (bool) – Indicates if the path is expected to the indicated kind.

Returns

tuple (result, message) WHERE result(bool) indicates if the expectation was satisfied. message(str) is a description of the result.

Return type

tuple

pdm_utils.functions.cartography module

pdm_utils.functions.cartography.get_map(mapper, table)

Get SQLAlchemy ORM map object.

Parameters
  • mapper (DeclarativeMeta) – Connected and prepared SQLAlchemy automap base object.

  • table (str) – Case-insensitive table to retrieve a ORM map for.

Returns

SQLAlchemy mapped object.

Return type

DeclarativeMeta

pdm_utils.functions.cartography.map_cds(metadata)
pdm_utils.functions.cartography.map_genome(metadata)

pdm_utils.functions.configfile module

Configuration file definition and parsing.

pdm_utils.functions.configfile.build_complete_config(file)

Buid a complete config object by merging user-supplied and default config.

pdm_utils.functions.configfile.create_empty_config_file(dir, file, null_value)

Create an empty config file with all available settings.

pdm_utils.functions.configfile.default_parser(null_value)

Constructs complete config with empty values.

pdm_utils.functions.configfile.default_sections_keys()
pdm_utils.functions.configfile.parse_config(file, parser=None)

Get parameters from config file.

pdm_utils.functions.configfile.setup_section(keys, value)
pdm_utils.functions.configfile.write_config(parser, filepath)

Write a ConfigParser to file.

pdm_utils.functions.eval_modes module

Evaluation mode functions and dictionaries.

pdm_utils.functions.eval_modes.get_eval_flag_dict(eval_mode)

Get a dictionary of evaluation flags.

Parameters

eval_mode (str) – Valid evaluation mode (base, draft, final, auto, misc, custom)

Returns

Dictionary of boolean values.

Return type

dict

pdm_utils.functions.fileio module

pdm_utils.functions.fileio.export_data_dict(data_dicts, file_path, headers, include_headers=False)

Save a dictionary of data to file using specified column headers.

Ensures the output file contains a specified number of columns, and it ensures the column headers are exported as well.

Parameters
  • data_dicts (list) – list of elements, where each element is a dictionary.

  • file_path (Path) – Path to file to export data.

  • headers (list) – List of strings to define the column order in the file. If include_headers is selected, the first row of the file will contain each string.

  • include_headers (bool) – Indicates whether the file should contain a row of column names derived from the headers parameter.

pdm_utils.functions.fileio.parse_feature_table(filehandle)

Takes a (five-column) feature table(s) file handle and parses the data.

Parameters

filehandle – Handle for a five-column formatted feature table file:

Returns

Returns a feature table file parser generator.

Return type

FeatureTableFileParser

pdm_utils.functions.fileio.read_feature_table(filehandle)

Reads a (five-column) feature table and parses the data into a seqrecord.

Parameters

filepath (Path) – Path to the five-column formatted feature table file.

Returns

Returns a Biopython SeqRecord object with the table data.

Return type

SeqRecord

pdm_utils.functions.fileio.reintroduce_fasta_duplicates(ts_to_gs, filepath)

Reads a fasta file and reintroduces (rewrittes) duplicate sequences guided by an ungapped translation to sequence-id map

Parameters
  • filepath (pathlib.Path) – Path to fasta-formatted multiple sequence file

  • ts_to_gs (dict) – Dictionary mapping unique translations to sequence-ids

pdm_utils.functions.fileio.retrieve_data_dict(filepath)

Open file and retrieve a dictionary of data.

Parameters

filepath (Path) – Path to file containing data and column names.

Returns

A list of elements, where each element is a dictionary representing one row of data. Each key is a column name and each value is the data stored in that field.

Return type

list

pdm_utils.functions.fileio.write_database(alchemist, version, export_path, db_name=None)

Output .sql file from the selected database.

Parameters
  • alchemist (AlchemyHandler) – A connected and fully built AlchemyHandler object.

  • version (int) – Database version information.

  • export_path (Path) – Path to a valid dir for file creation.

pdm_utils.functions.fileio.write_fasta(ids_seqs, infile_path, name=None)

Writes the input genes to the indicated file in FASTA multiple sequence format (unaligned). :param id_seqs: the ids and sequences to be written to file :type genes: dict :param infile_path: the path of the file to write the genes to :type infile: Path :type infile: str

pdm_utils.functions.fileio.write_feature_table(seqrecord_list, export_path, verbose=False)

Outputs files as five_column tab-delimited text files.

Parameters
  • seq_record_list (list[SeqRecord]) – List of populated SeqRecords.

  • export_path (Path) – Path to a dir for file creation.

  • verbose (bool) – A boolean value to toggle progress print statements.

pdm_utils.functions.fileio.write_seqrecord(seqrecord, file_path, file_format)
pdm_utils.functions.fileio.write_seqrecords(seqrecord_list, file_format, export_path, export_name=None, concatenate=False, threads=1, verbose=False)

Outputs files with a particuar format from a SeqRecord list.

Parameters
  • seq_record_list (list[SeqRecord]) – List of populated SeqRecords.

  • file_format (str) – Biopython supported file type.

  • export_path (Path) – Path to a dir for file creation.

  • concatenate – A boolean to toggle concatenation of SeqRecords.

  • verbose (bool) – A boolean value to toggle progress print statements.

pdm_utils.functions.flat_files module

Functions to interact with, use, and parse genomic data from GenBank-formatted flat files.

pdm_utils.functions.flat_files.cds_to_seqrecord(cds, parent_genome, gene_domains=[], desc_type='gb')

Creates a SeqRecord object from a Cds and its parent Genome.

Parameters
  • cds (Cds) – A populated Cds object.

  • phage_genome – Populated parent Genome object of the Cds object.

  • domains (list) – List of domain objects populated with column attributes

  • desc_type (str) – Inteneded format of the CDS SeqRecord description.

Returns

Filled Biopython SeqRecord object.

Return type

SeqRecord

pdm_utils.functions.flat_files.create_fasta_seqrecord(header, sequence_string)

Create a fasta-formatted Biopython SeqRecord object.

Parameters
  • header (str) – Description of the sequence.

  • sequence_string (str) – Nucleotide sequence.

Returns

Biopython SeqRecord containing the nucleotide sequence.

Return type

SeqRecord

pdm_utils.functions.flat_files.create_seqfeature_dictionary(seqfeature_list)

Create a dictionary of Biopython SeqFeature objects based on their type.

From a list of all Biopython SeqFeatures derived from a GenBank-formatted flat file, create a dictionary of SeqFeatures based on their ‘type’ attribute.

Parameters
  • seqfeature_list (list) – List of Biopython SeqFeatures

  • genome_id (str) – An identifier for the genome in which the seqfeature is defined.

Returns

A dictionary of Biopython SeqFeatures: Key: SeqFeature type (source, tRNA, CDS, other) Value: SeqFeature

Return type

dict

pdm_utils.functions.flat_files.format_cds_seqrecord_CDS_feature(cds_feature, cds, parent_genome)
pdm_utils.functions.flat_files.genome_to_seqrecord(phage_genome)

Creates a SeqRecord object from a pdm_utils Genome object.

Parameters

phage_genome (Genome) – A pdm_utils Genome object.

Returns

A BioPython SeqRecord object

Return type

SeqRecord

pdm_utils.functions.flat_files.get_cds_seqrecord_annotations(cds, parent_genome)

Function that creates a Cds SeqRecord annotations attribute dict. :param cds: A populated Cds object. :type cds: Cds :param phage_genome: Populated parent Genome object of the Cds object. :type phage_genome: Genome :returns: Formatted SeqRecord annotations dictionary. :rtype: dict{str}

pdm_utils.functions.flat_files.get_cds_seqrecord_annotations_comments(cds)

Function that creates a Cds SeqRecord comments attribute tuple.

Parameters

cds

pdm_utils.functions.flat_files.get_cds_seqrecord_regions(gene_domains, cds)
pdm_utils.functions.flat_files.get_genome_seqrecord_annotations(phage_genome)

Helper function that uses Genome data to populate the annotations SeqRecord attribute

Parameters

phage_genome (genome) – Input a Genome object.

Returns

annotations(dictionary) is a dictionary with the formatting of BioPython’s SeqRecord annotations attribute

pdm_utils.functions.flat_files.get_genome_seqrecord_annotations_comments(phage_genome)

Helper function that uses Genome data to populate the comment annotation attribute

Parameters

phage_genome (genome) – Input a Genome object.

Returns

cluster_comment, auto_generated_comment annotation_status_comment, qc_and_retrieval values (tuple) is a tuple with the formatting of BioPython’s SeqRecord annotations comment attribute

pdm_utils.functions.flat_files.get_genome_seqrecord_description(phage_genome)

Helper function to construct a description SeqRecord attribute.

Parameters

phage_genome (genome) – Input a Genome object.

Returns

description is a formatted string parsed from genome data

pdm_utils.functions.flat_files.get_genome_seqrecord_features(phage_genome)

Helper function that uses Genome data to populate the features SeqRecord atribute

Parameters

phage_genome (genome) – Input a Genome object.

Returns

features is a list of SeqFeature objects parsed from cds objects

pdm_utils.functions.flat_files.parse_cds_seqfeature(seqfeature)

Parse data from a Biopython CDS SeqFeature object into a Cds object.

Parameters
  • seqfeature (SeqFeature) – Biopython SeqFeature

  • genome_id (str) – An identifier for the genome in which the seqfeature is defined.

Returns

A pdm_utils Cds object

Return type

Cds

pdm_utils.functions.flat_files.parse_coordinates(seqfeature)

Parse the boundary coordinates from a GenBank-formatted flat file.

The functions takes a Biopython SeqFeature object containing data that was parsed from the feature in the flat file. Parsing these coordinates can be tricky. There can be more than one set of coordinates if it is a compound location. Only features with 1 or 2 open reading frames (parts) are correctly parsed. Also, the boundaries may not be precise; instead they may be open or fuzzy. Non-precise coordinates are converted to ‘-1’. If the strand is undefined, the coordinates are converted to ‘-1’ and parts is set to ‘0’. If an incorrect data type is provided, coorindates are set to ‘-1’ and parts is set to ‘0’.

Parameters

seqfeature (SeqFeature) – Biopython SeqFeature

Returns

tuple (start, stop, parts) WHERE start(int) is the first coordinate, regardless of strand. stop(int) is the second coordinate, regardless of strand. parts(int) is the number of open reading frames that define the feature.

pdm_utils.functions.flat_files.parse_genome_data(seqrecord, filepath=PosixPath('.'), translation_table=11, genome_id_field='_organism_name', gnm_type='', host_genus_field='_organism_host_genus')

Parse data from a Biopython SeqRecord object into a Genome object.

All Source, CDS, tRNA, and tmRNA features are parsed into their associated Source, Cds, Trna, and Tmrna objects.

Parameters
  • seqrecord (SeqRecord) – A Biopython SeqRecord object.

  • filepath (Path) – A filename associated with the returned Genome object.

  • translation_table (int) – The applicable translation table for the genome’s CDS features.

  • genome_id_field (str) – The SeqRecord attribute from which the unique genome identifier/name is stored.

  • host_genus_field (str) – The SeqRecord attribute from which the unique host genus identifier/name is stored.

  • gnm_type (str) – Identifier for the type of genome.

Returns

A pdm_utils Genome object.

Return type

Genome

pdm_utils.functions.flat_files.parse_source_seqfeature(seqfeature)

Parses a Biopython Source SeqFeature.

Parameters
  • seqfeature (SeqFeature) – Biopython SeqFeature

  • genome_id (str) – An identifier for the genome in which the seqfeature is defined.

Returns

A pdm_utils Source object

Return type

Source

pdm_utils.functions.flat_files.parse_tmrna_seqfeature(seqfeature)

Parses data from a BioPython tmRNA SeqFeature object into a Tmrna object. :param seqfeature: BioPython SeqFeature :type seqfeature: SeqFeature :return: pdm_utils Tmrna object :rtype: Tmrna

pdm_utils.functions.flat_files.parse_trna_seqfeature(seqfeature)

Parse data from a Biopython tRNA SeqFeature object into a Trna object. :param seqfeature: Biopython SeqFeature :type seqfeature: SeqFeature :returns: a pdm_utils Trna object :rtype: Trna

pdm_utils.functions.flat_files.retrieve_genome_data(filepath)

Retrieve data from a GenBank-formatted flat file.

Parameters

filepath (Path) – Path to GenBank-formatted flat file that will be parsed using Biopython.

Returns

If there is only one record, a Biopython SeqRecord of parsed data. If the file cannot be parsed, or if there are multiple records, None value is returned.

Return type

SeqRecord

pdm_utils.functions.flat_files.sort_seqrecord_features(seqrecord)

Function that sorts and processes the seqfeature objects of a seqrecord.

Parameters

seqrecord (SeqRecord) – Phage genome Biopython seqrecord object

pdm_utils.functions.mysqldb module

Functions to interact with MySQL.

pdm_utils.functions.mysqldb.change_version(engine, amount=1)

Change the database version number.

Parameters
  • engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.

  • amount (int) – Amount to increment/decrement version number.

pdm_utils.functions.mysqldb.check_schema_compatibility(engine, pipeline, code_version=None)

Confirm database schema is compatible with code.

If schema version is not compatible, sys.exit is called.

Parameters
  • engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.

  • pipeline (str) – Description of the pipeline checking compatibility.

  • code_version (int) – Schema version on which the pipeline operates. If no schema version is provided, the package-wide schema version value is used.

pdm_utils.functions.mysqldb.create_delete(table, field, data)

Create MySQL DELETE statement.

“‘DELETE FROM <table> WHERE <field> = ‘<data>’.”

Parameters
  • table (str) – The database table to insert information.

  • field (str) – The column upon which the statement is conditioned.

  • data (str) – The value of ‘field’ upon which the statement is conditioned.

Returns

A MySQL DELETE statement.

Return type

str

pdm_utils.functions.mysqldb.create_gene_table_insert(cds_ftr)

Create a MySQL gene table INSERT statement.

Parameters

cds_ftr (Cds) – A pdm_utils Cds object.

Returns

A MySQL statement to INSERT a new row in the ‘gene’ table with data for several fields.

Return type

str

pdm_utils.functions.mysqldb.create_genome_statements(gnm, tkt_type='')

Create list of MySQL statements based on the ticket type.

Parameters
  • gnm (Genome) – A pdm_utils Genome object.

  • tkt_type (str) – ‘add’ or ‘replace’.

Returns

List of MySQL statements to INSERT all data from a genome into the database (DELETE FROM genome, INSERT INTO phage, INSERT INTO gene, …).

Return type

list

pdm_utils.functions.mysqldb.create_phage_table_insert(gnm)

Create a MySQL phage table INSERT statement.

Parameters

gnm (Genome) – A pdm_utils Genome object.

Returns

A MySQL statement to INSERT a new row in the ‘phage’ table with data for several fields.

Return type

str

pdm_utils.functions.mysqldb.create_seq_set(engine)

Create set of genome sequences currently in a MySQL database.

Parameters

engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.

Returns

A set of unique values from phage.Sequence.

Return type

set

pdm_utils.functions.mysqldb.create_tmrna_table_insert(tmrna_ftr)
Parameters

tmrna_ftr

Returns

pdm_utils.functions.mysqldb.create_trna_table_insert(trna_ftr)

Create a MySQL trna table INSERT statement. :param trna_ftr: a pdm_utils Trna object :type trna_ftr: Trna :returns: a MySQL statement to INSERT a new row in the ‘trna’ table with all of trna_ftr’s relevant data :rtype: str

pdm_utils.functions.mysqldb.create_update(table, field2, value2, field1, value1)

Create MySQL UPDATE statement.

“‘UPDATE <table> SET <field2> = ‘<value2’ WHERE <field1> = ‘<data1>’.”

When the new value to be added is ‘singleton’ (e.g. for Cluster fields), or an empty value (e.g. None, “none”, etc.), the new value is set to NULL.

Parameters
  • table (str) – The database table to insert information.

  • field1 (str) – The column upon which the statement is conditioned.

  • value1 (str) – The value of ‘field1’ upon which the statement is conditioned.

  • field2 (str) – The column that will be updated.

  • value2 (str) – The value that will be inserted into ‘field2’.

Returns

A MySQL UPDATE statement.

Return type

set

pdm_utils.functions.mysqldb.execute_transaction(engine, statement_list=[])

Execute list of MySQL statements within a single defined transaction.

Parameters
  • engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL databas.

  • statement_list – a list of any number of MySQL statements with no expectation that anything will return

Returns

tuple (result, message) WHERE result (int) is 0 or 1 status code. 0 means no problems, 1 means problems message(str) is a description of the result.

Return type

tuple

pdm_utils.functions.mysqldb.get_schema_version(engine)

Identify the schema version of the database_versions_list.

Schema version data has not been persisted in every schema version, so if schema version data is not found, it is deduced from other parts of the schema.

Parameters

engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.

Returns

The version of the pdm_utils database schema.

Return type

int

pdm_utils.functions.mysqldb.parse_feature_data(engine, ftr_type, column=None, phage_id_list=None, query=None)

Returns Cds objects containing data parsed from a MySQL database.

Parameters
  • engine (Engine) – This parameter is passed directly to the ‘retrieve_data’ function.

  • query (str) – This parameter is passed directly to the ‘retrieve_data’ function.

  • ftr_type (str) – Indicates the type of features retrieved.

  • column (str) – This parameter is passed directly to the ‘retrieve_data’ function.

  • phage_id_list (list) – This parameter is passed directly to the ‘retrieve_data’ function.

Returns

A list of pdm_utils Cds objects.

Return type

list

pdm_utils.functions.mysqldb.parse_gene_table_data(data_dict, trans_table=11)

Parse a MySQL database dictionary to create a Cds object.

Parameters
  • data_dict (dict) – Dictionary of data retrieved from the gene table.

  • trans_table (int) – The translation table that can be used to translate CDS features.

Returns

A pdm_utils Cds object.

Return type

Cds

pdm_utils.functions.mysqldb.parse_genome_data(engine, phage_id_list=None, phage_query=None, gene_query=None, trna_query=None, tmrna_query=None, gnm_type='')

Returns a list of Genome objects containing data parsed from a MySQL database.

Parameters
  • engine (Engine) – This parameter is passed directly to the ‘retrieve_data’ function.

  • phage_query (str) – This parameter is passed directly to the ‘retrieve_data’ function to retrieve data from the phage table.

  • gene_query (str) – This parameter is passed directly to the ‘parse_feature_data’ function to retrieve data from the gene table. If not None, pdm_utils Cds objects for all of the phage’s CDS features in the gene table will be constructed and added to the Genome object.

  • trna_query (str) – This parameter is passed directly to the ‘parse_feature_data’ function to retrieve data from the trna table. If not None, pdm_utils Trna objects for all of the phage’s tRNA features in the trna table will be constructed and added to the Genome object.

  • tmrna_query (str) – This parameter is passed directly to the ‘parse_feature_data’ function to retrieve data from the tmrna table. If not None, pdm_utils Tmrna objects for all of the phage’s tmRNA features in the tmrna table will be constructed and added to the Genome object.

  • phage_id_list (list) – This parameter is passed directly to the ‘retrieve_data’ function. If there is at at least one valid PhageID, a pdm_utils genome object will be constructed only for that phage. If None, or an empty list, genome objects for all phages in the database will be constructed.

  • gnm_type (str) – Identifier for the type of genome.

Returns

A list of pdm_utils Genome objects.

Return type

list

pdm_utils.functions.mysqldb.parse_phage_table_data(data_dict, trans_table=11, gnm_type='')

Parse a MySQL database dictionary to create a Genome object.

Parameters
  • data_dict (dict) – Dictionary of data retrieved from the phage table.

  • trans_table (int) – The translation table that can be used to translate CDS features.

  • gnm_type (str) – Identifier for the type of genome.

Returns

A pdm_utils genome object.

Return type

genome

pdm_utils.functions.mysqldb.parse_tmrna_table_data(data_dict)

Parse a MySQL database dictionary to create a Tmrna object.

Parameters

data_dict (dict) – Dictionary of data retrieved from the gene table.

Returns

A pdm_utils Tmrna object.

Return type

Tmrna

pdm_utils.functions.mysqldb.parse_trna_table_data(data_dict)

Parse a MySQL database dictionary to create a Trna object.

Parameters

data_dict (dict) – Dictionary of data retrieved from the gene table.

Returns

A pdm_utils Trna object.

Return type

Trna

pdm_utils.functions.mysqldb_basic module

Basic functions to interact with MySQL and manage databases.

pdm_utils.functions.mysqldb_basic.convert_for_sql(value, check_set={}, single=True)

Convert a value for inserting into MySQL.

Parameters
  • value (misc) – Value that should be checked for conversion.

  • check_set (set) – Set of values to check against.

  • single (bool) – Indicates whether single quotes should be used.

Returns

Returns either “NULL” or the value encapsulated in quotes (“‘value’” or ‘“value”’)

Return type

str

pdm_utils.functions.mysqldb_basic.copy_db(engine, new_database)

Copies a database.

Parameters
  • engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database, which contains the name of the database that will be copied into the new database.

  • new_database (str) – Name of the new copied database.

Returns

Indicates if copy was successful (0) or failed (1).

Return type

int

pdm_utils.functions.mysqldb_basic.create_db(engine, database)

Create a new, empty database.

Parameters
  • engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.

  • database (str) – Name of the database to create.

Returns

Indicates if create was successful (0) or failed (1).

Return type

int

pdm_utils.functions.mysqldb_basic.db_exists(engine, database)

Check if given name for a local MySQL database exists.

Parameters
  • engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.

  • database (str) – The name of the database to check exists.

Returns

Returns whether the database exists

Return type

bool

pdm_utils.functions.mysqldb_basic.drop_create_db(engine, database)

Creates a new, empty database.

Parameters
  • engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.

  • database (str) – Name of the database to drop and create.

Returns

Indicates if drop/create was successful (0) or failed (1).

Return type

int

pdm_utils.functions.mysqldb_basic.drop_db(engine, database)

Delete a database.

Parameters
  • engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.

  • database (str) – Name of the database to drop.

Returns

Indicates if drop was successful (0) or failed (1).

Return type

int

pdm_utils.functions.mysqldb_basic.first(engine, executable, return_dict=True)

Execute a query and get the first row of data.

Parameters
  • engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.

  • executable (str) – Input an executable MySQL query.

  • return_dict (Boolean) – Toggle whether execute returns dict or tuple.

Returns

Results from execution of given MySQL query.

Return type

dict

Return type

tuple

pdm_utils.functions.mysqldb_basic.get_columns(engine, database, table_name)

Retrieve columns names from a table.

Parameters
  • engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.

  • database (str) – Name of the database to query.

  • table_name (str) – Name of the table to query.

Returns

Set of column names.

Return type

set

pdm_utils.functions.mysqldb_basic.get_distinct(engine, table, column, null=None)

Get set of distinct values currently in a MySQL database.

Parameters
  • engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.

  • table (str) – A valid table in the database.

  • column (str) – A valid column in the table.

  • null (misc) – Replacement value for NULL data.

Returns

A set of distinct values from the database.

Return type

set

pdm_utils.functions.mysqldb_basic.get_first_row_data(engine, table)

Retrieves data from the first row of a table.

Parameters

engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.

Returns

Dictionary where key = column name.

Return type

dict

pdm_utils.functions.mysqldb_basic.get_mysql_dbs(engine)

Retrieve database names from MySQL.

Parameters

engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.

Returns

Set of database names.

Return type

set

pdm_utils.functions.mysqldb_basic.get_table_count(engine, table)

Get the current number of genomes in the database.

Parameters

engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.

Returns

Number of rows from the phage table.

Return type

int

pdm_utils.functions.mysqldb_basic.get_tables(engine, database)

Retrieve tables names from the database.

Parameters

engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.

Returns

Set of table names.

Return type

set

pdm_utils.functions.mysqldb_basic.install_db(engine, schema_filepath)

Install a MySQL file into the indicated database.

Parameters
  • engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL databas.

  • schema_filepath (Path) – Path to the MySQL database file.

Returns

Indicates if copy was successful (0) or failed (1).

Return type

int

pdm_utils.functions.mysqldb_basic.mysql_login_command(username, password, database)

Construct list of strings representing a mysql command.

pdm_utils.functions.mysqldb_basic.mysqldump_command(username, password, database)

Construct list of strings representing a mysqldump command.

pdm_utils.functions.mysqldb_basic.pipe_commands(command1, command2)

Pipe one command into the other.

pdm_utils.functions.mysqldb_basic.query_dict_list(engine, query)

Get the results of a MySQL query as a list of dictionaries.

Parameters
  • engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.

  • query (str) – MySQL query statement.

Returns

List of dictionaries, where each dictionary represents a row of data.

Return type

list

pdm_utils.functions.mysqldb_basic.query_set(engine, query)

Retrieve set of data from MySQL query.

Parameters
  • engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.

  • query (str) – MySQL query statement.

Returns

Set of queried data.

Return type

set

pdm_utils.functions.mysqldb_basic.retrieve_data(engine, column=None, query=None, id_list=None)

Retrieve genome data from a MySQL database for a single genome.

The query is modified to include one or more values.

Parameters
  • engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.

  • query (str) – A MySQL query that selects valid, specific columns from the a valid table without conditioning on a specific column (e.g. ‘SELECT Column1, Column2 FROM table1’).

  • column (str) – A valid column in the table upon which the query can be conditioned.

  • id_list (list) – A list of valid values upon which the query can be conditioned. In conjunction with the ‘column’ parameter, the ‘query’ is modified (e.g. “WHERE Column1 IN (‘Value1’, ‘Value2’)”).

Returns

A list of items, where each item is a dictionary of SQL data for each PhageID.

Return type

list

pdm_utils.functions.mysqldb_basic.scalar(engine, executable)

Execute a query and get the first field.

Parameters
  • engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.

  • executable (str) – Input an executable MySQL query.

Returns

Scalar result from execution of given MySQL query.

Return type

int

pdm_utils.functions.ncbi module

Misc. functions to interact with NCBI databases.

pdm_utils.functions.ncbi.get_accessions_to_retrieve(summary_records)

Extract accessions from summary records.

Parameters

summary_records (list) – List of dictionaries, where each dictionary is a record summary.

Returns

List of accessions.

Return type

list

pdm_utils.functions.ncbi.get_data_handle(accession_list, db='nucleotide', rettype='gb', retmode='text')
pdm_utils.functions.ncbi.get_records(accession_list, db='nucleotide', rettype='gb', retmode='text')

Retrieve records from NCBI from a list of active accessions.

Uses NCBI efetch implemented through BioPython Entrez.

Parameters
  • accession_list (list) – List of NCBI accessions.

  • db (str) – Name of the database to get summaries from (e.g. ‘nucleotide’).

  • rettype (str) – Type of record to retrieve (e.g. ‘gb’).

  • retmode (str) – Format of data to retrieve (e.g. ‘text’).

Returns

List of BioPython SeqRecords generated from GenBank records.

Return type

list

pdm_utils.functions.ncbi.get_summaries(db='', query_key='', webenv='')

Retrieve record summaries from NCBI for a list of accessions.

Uses NCBI esummary implemented through BioPython Entrez.

Parameters
  • db (str) – Name of the database to get summaries from.

  • query_key (str) – Identifier for the search. This can be directly generated from run_esearch().

  • webenv (str) – Identifier that can be directly generated from run_esearch()

Returns

List of dictionaries, where each dictionary is a record summary.

Return type

list

pdm_utils.functions.ncbi.get_verified_data_handle(acc_id_dict, ncbi_cred_dict={}, batch_size=200, file_type='gb')

Retrieve genomes from GenBank.

output_folder = Path to where files will be saved. acc_id_dict = Dictionary where key = Accession and value = List[PhageIDs]

pdm_utils.functions.ncbi.run_esearch(db='', term='', usehistory='')

Search for valid records in NCBI.

Uses NCBI esearch implemented through BioPython Entrez.

Parameters
  • db (str) – Name of the database to search.

  • term (str) – Search term.

  • usehistory (str) – Indicates if prior searches should be used.

Returns

Results of the search for each valid record.

Return type

dict

pdm_utils.functions.ncbi.set_entrez_credentials(tool=None, email=None, api_key=None)

Set BioPython Entrez credentials to improve speed and reliability.

Parameters
  • tool (str) – Name of the software/tool being used.

  • email (str) – Email contact information for NCBI.

  • api_key (str) – Unique NCBI-issued identifier to enhance retrieval speed.

pdm_utils.functions.parallelize module

Functions to parallelize of processing of a list of inputs. Adapted from https://docs.python.org/3/library/multiprocessing.html

pdm_utils.functions.parallelize.count_processors(inputs, num_processors)

Programmatically determines whether the specified num_processors is appropriate. There’s no need to use more processors than there are inputs, and it’s impossible to use fewer than 1 processor or more than exist on the machine running the code. :param inputs: list of inputs :param num_processors: specified number of processors :return: num_processors (optimized)

pdm_utils.functions.parallelize.parallelize(inputs, num_processors, task, verbose=True)

Parallelizes some task on an input list across the specified number of processors :param inputs: list of inputs :param num_processors: number of processor cores to use :param task: name of the function to run :param verbose: updating progress bar output? :return: results

pdm_utils.functions.parallelize.start_processes(inputs, num_processors, verbose)

Creates input and output queues, and runs the jobs :param inputs: jobs to run :param num_processors: optimized number of processors :param verbose: updating progress bar output? :return: results

pdm_utils.functions.parallelize.worker(input_queue, output_queue)

pdm_utils.functions.parsing module

pdm_utils.functions.parsing.check_operator(operator, column_object)

Validates an operator’s application on a MySQL column.

Parameters
  • operator (str) – Accepted MySQL operator.

  • column_object (Column) – A SQLAlchemy Column object.

pdm_utils.functions.parsing.create_filter_key(unparsed_filter)

Creates a standardized filter string from a valid unparsed_filter.

Parameters

unparsed_filter – Formatted MySQL WHERE clause.

Returns

Standardized MySQL conditional string.

Return type

str

pdm_utils.functions.parsing.parse_cmd_list(unparsed_string_list)

Recognizes and parses MySQL WHERE clause structures from cmd lists.

Parameters

unparsed_string_list (list[str]) – Formatted MySQL WHERE clause arguments.

Returns

2-D array containing lists of statements joined by ORs.

Return type

list[list]

pdm_utils.functions.parsing.parse_cmd_string(unparsed_cmd_string)

Recognizes and parses MySQL WHERE clause structures.

Parameters

unparsed_cmd_string (str) – Formatted MySQL WHERE clause string.

Returns

2-D array containing lists of statements joined by ORs.

Return type

list[list]

pdm_utils.functions.parsing.parse_column(unparsed_column)

Recognizes and parses a MySQL structured column.

Parameters

unparsed_column (str) – Formatted MySQL column.

Returns

List containing segments of a MySQL column.

Return type

list[str]

pdm_utils.functions.parsing.parse_filter(unparsed_filter)

Recognizes and parses a MySQL structured WHERE clause.

Parameters

unparsed_filter – Formatted MySQL WHERE clause.

Returns

List containing segments of a MySQL WHERE clause.

Return type

list[str]

pdm_utils.functions.parsing.parse_in_spaces(unparsed_string_list)

Convert a list of strings to a single space separated string.

Parameters

unparsed_string_list (list[str]) – String list to be concatenated

Returns

String with parsed in whitespace.

Return type

str

pdm_utils.functions.parsing.parse_out_ends(unparsed_string)

Parse and remove beginning and end whitespace of a string.

Parameters

unparsed_string (str) – String with variable terminal whitespaces.

Returns

String with parsed and removed beginning and ending whitespace.

Return type

str

pdm_utils.functions.parsing.parse_out_spaces(unparsed_string)

Parse and remove beginning and internal white space of a string.

Parameters

unparsed_string (str) – String with variable terminal whitespaces.

Returns

String with parsed and removed beginning and ending whitespace.

Return type

str

pdm_utils.functions.parsing.translate_column(metadata, raw_column)

Converts a case-insensitve {table}.{column} str to a case-sensitive str.

Parameters
  • metadata (MetaData) – Reflected SQLAlchemy MetaData object.

  • raw_column (str) – Case-insensitive {table}.{column}.

Returns

Case-sensitive column name.

Return type

str

pdm_utils.functions.parsing.translate_table(metadata, raw_table)

Converts a case-insensitive table name to a case-sensitive str.

Parameters
  • metadata (MetaData) – Reflected SQLAlchemy MetaData object.

  • raw_table – Case-insensitive table name.

Type_table

str

Returns

Case-sensitive table name.

Return type

str

pdm_utils.functions.phagesdb module

Functions to interact with PhagesDB

pdm_utils.functions.phagesdb.construct_phage_url(phage_name)

Create URL to retrieve phage-specific data from PhagesDB.

Parameters

phage_name (str) – Name of the phage of interest.

Returns

URL pertaining to the phage.

Return type

str

pdm_utils.functions.phagesdb.create_cluster_subcluster_sets(url='https://phagesdb.org/api/clusters/')

Create sets of clusters and subclusters currently in PhagesDB.

Parameters

url (str) – A URL from which to retrieve cluster and subcluster data.

Returns

tuple (cluster_set, subcluster_set) WHERE cluster_set(set) is a set of all unique clusters on PhagesDB. subcluster_set(set) is a set of all unique subclusters on PhagesDB.

Return type

tuple

pdm_utils.functions.phagesdb.create_host_genus_set(url='https://phagesdb.org/api/host_genera/')

Create a set of host genera currently in PhagesDB.

Parameters

url (str) – A URL from which to retrieve host genus data.

Returns

All unique host genera listed on PhagesDB.

Return type

set

pdm_utils.functions.phagesdb.get_genome(phage_id, gnm_type='', seq=False)

Get genome data from PhagesDB.

Parameters
  • phage_id (str) – The name of the phage to be retrieved from PhagesDB.

  • gnm_type (str) – Identifier for the type of genome.

  • seq (bool) – Indicates whether the genome sequence should be retrieved.

Returns

A pdm_utils Genome object with the parsed data. If not genome is retrieved, None is returned.

Return type

Genome

pdm_utils.functions.phagesdb.get_phagesdb_data(url)

Retrieve all sequenced genome data from PhagesDB.

Parameters

url (str) – URL to connect to PhagesDB API.

Returns

List of dictionaries, where each dictionary contains data for each phage. If a problem is encountered during retrieval, an empty list is returned.

Return type

list

pdm_utils.functions.phagesdb.get_unphamerated_phage_list(url)

Retreive list of unphamerated phages from PhagesDB.

Parameters

url (str) – A URL from which to retrieve a list of PhagesDB genomes that are not in the most up-to-date instance of the Actino_Draft MySQL database.

Returns

List of PhageIDs.

Return type

list

pdm_utils.functions.phagesdb.parse_accession(data_dict)

Retrieve Accession from PhagesDB.

Parameters

data_dict (dict) – Dictionary of data retrieved from PhagesDB.

Returns

Accession of the phage.

Return type

str

pdm_utils.functions.phagesdb.parse_cluster(data_dict)

Retrieve Cluster from PhagesDB.

If the phage is clustered, ‘pcluster’ is a dictionary, and one key is the Cluster data (Cluster or ‘Singleton’). If for some reason no Cluster info is added at the time the genome is added to PhagesDB, ‘pcluster’ may automatically be set to NULL, which gets converted to “Unclustered” during retrieval. In the MySQL database NULL means Singleton, and the long form “Unclustered” is invalid due to its character length, so this value is converted to ‘UNK’ (‘Unknown’).

Parameters

data_dict (dict) – Dictionary of data retrieved from PhagesDB.

Returns

Cluster of the phage.

Return type

str

pdm_utils.functions.phagesdb.parse_fasta_data(fasta_data)

Parses data returned from a fasta-formatted file.

Parameters

fasta_data (str) – Data from a fasta file.

Returns

tuple (header, sequence) WHERE header(str) is the first line parsed from the parsed file. sequence(str) is the nucleotide sequence parsed from the file.

Return type

tuple

pdm_utils.functions.phagesdb.parse_fasta_filename(data_dict)

Retrieve fasta filename from PhagesDB.

Parameters

data_dict (dict) – Dictionary of data retrieved from PhagesDB.

Returns

Name of the fasta file for the phage.

Return type

str

pdm_utils.functions.phagesdb.parse_genome_data(data_dict, gnm_type='', seq=False)

Parses a dictionary of PhagesDB genome data into a pdm_utils Genome object.

Parameters
  • data_dict (dict) – Dictionary of data retrieved from PhagesDB.

  • gnm_type (str) – Identifier for the type of genome.

  • seq (bool) – Indicates whether the genome sequence should be retrieved.

Returns

A pdm_utils Genome object with the parsed data.

Return type

Genome

pdm_utils.functions.phagesdb.parse_genomes_dict(data_dict, gnm_type='', seq=False)

Returns a dictionary of pdm_utils Genome objects

Parameters
  • data_dict (dict) – Dictionary of dictionaries. Key = PhageID. Value = Dictionary of genome data retrieved from PhagesDB.

  • gnm_type (str) – Identifier for the type of genome.

  • seq (bool) – Indicates whether the genome sequence should be retrieved.

Returns

Dictionary of pdm_utils Genome object. Key = PhageID. Value = Genome object.

Return type

dict

pdm_utils.functions.phagesdb.parse_host_genus(data_dict)

Retrieve host_genus from PhagesDB.

Parameters

data_dict (dict) – Dictionary of data retrieved from PhagesDB.

Returns

Host genus of the phage.

Return type

str

pdm_utils.functions.phagesdb.parse_phage_name(data_dict)

Retrieve Phage Name from PhagesDB.

Parameters

data_dict (dict) – Dictionary of data retrieved from PhagesDB.

Returns

Name of the phage.

Return type

str

pdm_utils.functions.phagesdb.parse_subcluster(data_dict)

Retrieve Subcluster from PhagesDB.

If for some reason no cluster info is added at the time the genome is added to PhagesDB, ‘psubcluster’ may automatically be set to NULL, which gets returned as None. If the phage is a Singleton, ‘psubcluster’ is None. If the phage is clustered but not subclustered, ‘psubcluster’ is None. If the phage is clustered and subclustered, ‘psubcluster’ is a dictionary, and one key is the Subcluster data.

Parameters

data_dict (dict) – Dictionary of data retrieved from PhagesDB.

Returns

Subcluster of the phage.

Return type

str

pdm_utils.functions.phagesdb.retrieve_data_list(url)

Retrieve list of data from PhagesDB.

Parameters

url (str) – A URL from which to retrieve data.

Returns

A list of data retrieved from the URL.

Return type

list

pdm_utils.functions.phagesdb.retrieve_genome_data(phage_url)

Retrieve all data from PhagesDB for a specific phage.

Parameters

phage_url (str) – URL for data pertaining to a specific phage.

Returns

Dictionary of data parsed from the URL.

Return type

dict

pdm_utils.functions.phagesdb.retrieve_url_data(url)

Retrieve fasta file from PhagesDB.

Parameters

url (str) – URL for data to be retrieved.

Returns

Data from the URL.

Return type

str

pdm_utils.functions.phameration module

Functions that are used in the phameration pipeline

pdm_utils.functions.phameration.blastp(index, chunk, tmp, db_path, evalue, query_cov)

Runs ‘blastp’ using the given chunk as the input gene set. The blast output is an adjacency matrix for this chunk. :param index: chunk index being run :type index: int :param chunk: the translations to run right now :type chunk: tuple of 2-tuples :param tmp: path where I/O can go on :type tmp: str :param db_path: path to the target blast database :type db_path: str :param evalue: e-value cutoff to report hits :type evalue: float

pdm_utils.functions.phameration.chunk_translations(translation_groups, chunksize=500)

Break translation_groups into a dictionary of chunksize-tuples of 2-tuples where each 2-tuple is a translation and its corresponding geneid. :param translation_groups: translations and their geneids :type translation_groups: dict :param chunksize: how many translations will be in a chunk? :type chunksize: int :return: chunks :rtype: dict

pdm_utils.functions.phameration.create_blastdb(fasta, db_name, db_path)

Runs ‘makeblastdb’ to create a BLAST-searchable database. :param fasta: FASTA-formatted input file :type fasta: str :param db_name: BLAST sequence database :type db_name: str :param db_path: BLAST sequence database path :type db_path: str

pdm_utils.functions.phameration.fix_colored_orphams(engine)

Find any single-member phams which are colored as though they are multi-member phams (not #FFFFFF in pham.Color). :param engine: sqlalchemy Engine allowing access to the database :return:

pdm_utils.functions.phameration.fix_white_phams(engine)

Find any phams with 2+ members which are colored as though they are orphams (#FFFFFF in pham.Color). :param engine: sqlalchemy Engine allowing access to the database :return:

pdm_utils.functions.phameration.get_geneids_and_translations(engine)

Constructs a dictionary mapping all geneids to their translations. :param engine: the Engine allowing access to the database :return: gs_to_ts

pdm_utils.functions.phameration.get_new_geneids(engine)

Queries the database for those genes that are not yet phamerated. :param engine: the Engine allowing access to the database :return: new_geneids

pdm_utils.functions.phameration.get_pham_colors(engine)

Queries the database for the colors of existing phams :param engine: the Engine allowing access to the database :return: pham_colors

pdm_utils.functions.phameration.get_pham_geneids(engine)

Queries the database for those genes that are already phamerated. :param engine: the Engine allowing access to the database :return: pham_geneids

pdm_utils.functions.phameration.get_translation_groups(engine)

Constructs a dictionary mapping all unique translations to their groups of geneids that share them :param engine: the Engine allowing access to the database :return: ts_to_gs

pdm_utils.functions.phameration.markov_cluster(adj_mat_file, inflation, tmp_dir)

Run ‘mcl’ on an adjacency matrix to cluster the blastp results. :param adj_mat_file: 3-column file with blastp resultant queries, subjects, and evalues :type adj_mat_file: str :param inflation: mcl inflation parameter :type inflation: float :param tmp_dir: file I/O directory :type tmp_dir: str :return: outfile :rtype: str

pdm_utils.functions.phameration.merge_pre_and_hmm_phams(hmm_phams, pre_phams, consensus_lookup)

Merges the pre-pham sequences (which contain all nr sequences) with the hmm phams (which contain only hmm consensus sequences) into the full hmm-based clustering output. Uses consensus_lookup dictionary to find the pre-pham that each consensus belongs to, and then adds each pre-pham geneid to a full pham based on the hmm phams. :param hmm_phams: clustered consensus sequences :type hmm_phams: dict :param pre_phams: clustered sequences (used to generate hmms) :type pre_phams: dict :param consensus_lookup: reverse-mapped pre_phams :type consensus_lookup: dict :return: phams :rtype: dict

pdm_utils.functions.phameration.mmseqs_clust(consensus_db, align_db, cluster_db)

Runs ‘mmseqs clust’ to cluster an MMseqs2 consensus database using an MMseqs2 alignment database, with results being saved to an MMseqs2 cluster database. :param consensus_db: MMseqs sequence database :type consensus_db: str :param align_db: MMseqs2 alignment database :type align_db: str :param cluster_db: MMseqs2 cluster database :type cluster_db: str

pdm_utils.functions.phameration.mmseqs_cluster(sequence_db, cluster_db, args)

Runs ‘mmseqs cluster’ to cluster an MMseqs2 sequence database. :param sequence_db: MMseqs2 sequence database :type sequence_db: str :param cluster_db: MMseqs2 clustered database :type cluster_db: str :param args: parsed command line arguments :type args: dict

pdm_utils.functions.phameration.mmseqs_createdb(fasta, sequence_db)

Runs ‘mmseqs createdb’ to convert a FASTA file into an MMseqs2 sequence database. :param fasta: path to the FASTA file to convert :type fasta: str :param sequence_db: MMseqs2 sequence database :type sequence_db: str

pdm_utils.functions.phameration.mmseqs_createseqfiledb(sequence_db, cluster_db, seqfile_db)

Runs ‘mmseqs createseqfiledb’ to create the intermediate to the FASTA-like parseable output. :param sequence_db: MMseqs2 sequence database :type sequence_db: str :param cluster_db: MMseqs2 clustered database :type cluster_db: str :param seqfile_db: MMseqs2 seqfile database :type seqfile_db: str

pdm_utils.functions.phameration.mmseqs_profile2consensus(profile_db, consensus_db)

Runs ‘mmseqs profile2consensus’ to extract consensus sequences from an MMseqs2 profile database, and creates an MMseqs2 sequence database from the consensuses. :param profile_db: MMseqs2 profile database :type profile_db: str :param consensus_db: MMseqs2 sequence database :type consensus_db: str

pdm_utils.functions.phameration.mmseqs_result2flat(query_db, target_db, seqfile_db, outfile)

Runs ‘mmseqs result2flat’ to create FASTA-like parseable output. :param query_db: MMseqs2 sequence or profile database :type query_db: str :param target_db: MMseqs2 sequence database :type target_db: str :param seqfile_db: MMseqs2 seqfile database :type seqfile_db: str :param outfile: FASTA-like parseable output :type outfile: str

pdm_utils.functions.phameration.mmseqs_result2profile(sequence_db, cluster_db, profile_db)

Runs ‘mmseqs result2profile’ to convert clusters from one MMseqs2 clustered database into a profile database. :param sequence_db: MMseqs2 sequence database :type sequence_db: str :param cluster_db: MMseqs2 clustered database :type cluster_db: str :param profile_db: MMseqs2 profile database :type profile_db: str

Runs ‘mmseqs search’ to search profiles against their consensus sequences and save the alignment results to an MMseqs2 alignment database. The profile_db and consensus_db MUST be the same size. :param profile_db: MMseqs2 profile database :type profile_db: str :param consensus_db: MMseqs2 sequence database :type consensus_db: str :param align_db: MMseqs2 alignment database :type align_db: str :param args: parsed command line arguments :type args: dict

pdm_utils.functions.phameration.parse_mcl_output(outfile)

Parse the mci output into phams :param outfile: mci output file :type outfile: str :return: phams :rtype: dict

pdm_utils.functions.phameration.parse_mmseqs_output(outfile)

Parses the indicated MMseqs2 FASTA-like file into a dictionary of integer-named phams. :param outfile: FASTA-like parseable output :type outfile: str :return: phams :rtype: dict

pdm_utils.functions.phameration.preserve_phams(old_phams, new_phams, old_colors, new_genes)

Attempts to keep pham numbers consistent from one round of pham building to the next :param old_phams: the dictionary that maps old phams to their genes :param new_phams: the dictionary that maps new phams to their genes :param old_colors: the dictionary that maps old phams to colors :param new_genes: the set of previously unphamerated genes :return:

pdm_utils.functions.phameration.reintroduce_duplicates(new_phams, trans_groups, genes_and_trans)

Reintroduces into each pham ALL GeneIDs that map onto the set of translations in the pham. :param new_phams: the pham dictionary for which duplicates are to be reintroduced :param trans_groups: the dictionary that maps translations to the GeneIDs that share them :param genes_and_trans: the dictionary that maps GeneIDs to their translations :return:

pdm_utils.functions.phameration.update_gene_table(phams, engine)

Updates the gene table with new pham data :param phams: new pham gene data :type phams: dict :param engine: sqlalchemy Engine allowing access to the database :return:

pdm_utils.functions.phameration.update_pham_table(colors, engine)

Populates the pham table with the new PhamIDs and their colors. :param colors: new pham color data :type colors: dict :param engine: sqlalchemy Engine allowing access to the database :return:

pdm_utils.functions.phameration.write_fasta(translation_groups, outfile)

Writes a FASTA file of the non-redundant protein sequences to be assorted into phamilies. :param translation_groups: groups of genes that share a translation :type translation_groups: dict :param outfile: FASTA filename :type outfile: str :return:

pdm_utils.functions.pipelines_basic module

pdm_utils.functions.pipelines_basic.add_sort_columns(db_filter, sort_columns, verbose=False)
pdm_utils.functions.pipelines_basic.build_alchemist(database, ask_database=True, config=None, dialect='mysql')
pdm_utils.functions.pipelines_basic.build_filter(alchemist, key, filters, values=None, verbose=False)

Applies MySQL WHERE clause filters using a Filter.

Parameters
  • alchemist (AlchemyHandler) – A connected and fully built AlchemyHandler object.

  • table (str) – MySQL table name.

  • filters (list[list[str]]) – A list of lists with filter values, grouped by ORs.

  • groups (list[str]) – A list of supported MySQL column names.

Returns

filter-Loaded Filter object.

Return type

Filter

pdm_utils.functions.pipelines_basic.build_groups_map(db_filter, export_path, groups=[], verbose=False, force=False, dump=False)

Function that generates a map between conditionals and grouping paths.

Parameters
  • db_filter (Filter) – A connected and fully loaded Filter object.

  • export_path – Path to a dir for new dir creation.

  • groups (list[str]) – A list of supported MySQL column names.

  • conditionals_map (dict{Path:list}) – A mapping between group conditionals and Paths.

  • verbose (bool) – A boolean value to toggle progress print statements.

  • previous (str) – Value set by function to provide info for print statements

  • depth (int) – Value set by function to provide info for print statements.

Returns conditionals_map

A mapping between group conditionals and Paths.

Return type

dict{Path:list}

pdm_utils.functions.pipelines_basic.build_groups_tree(db_filter, export_path, conditionals_map, groups=[], verbose=False, force=False, previous=None, depth=0)

Recursive function that generates directories based on groupings.

Parameters
  • db_filter (Filter) – A connected and fully loaded Filter object.

  • export_path – Path to a dir for new dir creation.

  • groups (list[str]) – A list of supported MySQL column names.

  • conditionals_map (dict{Path:list} :param verbose: A boolean value to toggle progress print statements.) – A mapping between group conditionals and Paths.

  • previous (str) – Value set by function to provide info for print statements

  • depth (int) – Value set by function to provide info for print statements.

Returns conditionals_map

A mapping between group conditionals and Paths.

Return type

dict{Path:list}

pdm_utils.functions.pipelines_basic.convert_dir_path(path)

Function to convert argparse input to a working directory path.

Parameters

path (str) – A string to be converted into a Path object.

Returns

A Path object converted from the inputed string.

Return type

Path

pdm_utils.functions.pipelines_basic.convert_file_path(path)

Function to convert argparse input to a working file path.

Parameters

path (str) – A string to be converted into a Path object.

Returns

A Path object converted from the inputed string.

Return type

Path

pdm_utils.functions.pipelines_basic.create_default_path(name, force=False, attempt=50)
pdm_utils.functions.pipelines_basic.create_working_dir(working_path, dump=False, force=False)
pdm_utils.functions.pipelines_basic.create_working_path(folder_path, folder_name, dump=False, force=False, attempt=50)
pdm_utils.functions.pipelines_basic.parse_value_input(value_list_input)
pdm_utils.functions.pipelines_basic.parse_value_input(value_list_input: pathlib.Path)
pdm_utils.functions.pipelines_basic.parse_value_input(value_list_input: list)

Function to convert values input to a recognized data types.

Parameters

value_list_input (Path) – Values stored in recognized data types.

Returns

List of values to filter database results.

Return type

list[str]

pdm_utils.functions.querying module

pdm_utils.functions.querying.append_group_by_clauses(executable, group_by_clauses)

Add GROUP BY SQLAlchemy Column objects to a Select object.

Parameters
  • executable (Select) – SQLAlchemy executable query object.

  • order_by_clauses (list) – MySQL GROUP BY clause-related SQLAlchemy object(s)

Returns

MySQL expression-related SQLAlchemy exectuable.

Return type

Select

pdm_utils.functions.querying.append_having_clauses(executable, having_clauses)

Add HAVING SQLAlchemy Column objects to a Select object.

Parameters
  • executable (Select) – SQLAlchemy executable query object.

  • having_clauses – MySQL HAVING clause-related SQLAlchemy object(s).

:returns MySQL expression-related SQLAlchemy executable. :rtype: Select

pdm_utils.functions.querying.append_order_by_clauses(executable, order_by_clauses)

Add ORDER BY SQLAlchemy Column objects to a Select object.

Parameters
  • executable (Select) – SQLAlchemy executable query object.

  • order_by_clauses (list) – MySQL ORDER BY clause-related SQLAlchemy object(s)

Returns

MySQL expression-related SQLAlchemy exectuable.

Return type

Select

pdm_utils.functions.querying.append_where_clauses(executable, where_clauses)

Add WHERE SQLAlchemy BinaryExpression objects to a Select object.

Parameters
  • executable (Select) – SQLAlchemy executable query object.

  • where_clauses (list) – MySQL WHERE clause-related SQLAlchemy object(s).

Returns

MySQL expression-related SQLAlchemy exectuable.

Return type

Select

pdm_utils.functions.querying.build_count(db_graph, columns, where=None, add_in=None)

Get MySQL COUNT() expression SQLAlchemy executable.

Parameters
  • db_graph (Graph) – SQLAlchemy structured NetworkX Graph object.

  • columns (list) – SQLAlchemy Column object(s).

  • where (list) – MySQL WHERE clause-related SQLAlchemy object(s).

  • add_in (list) – MySQL Column-related inputs to be considered for joining.

Returns

MySQL COUNT() expression-related SQLAlchemy executable.

Return type

Select

pdm_utils.functions.querying.build_distinct(db_graph, columns, where=None, order_by=None, add_in=None)

Get MySQL DISTINCT expression SQLAlchemy executable.

Parameters
  • db_graph (Graph) – SQLAlchemy structured NetworkX Graph object.

  • columns (list) – SQLAlchemy Column object(s).

  • where (list) – MySQL WHERE clause-related SQLAlchemy object(s).

  • order_by (list) – MySQL ORDER BY clause-related SQLAlchemy object(s).

  • add_in (list) – MySQL Column-related inputs to be considered for joining.

Returns

MySQL DISTINCT expression-related SQLAlchemy executable.

Return type

Select

pdm_utils.functions.querying.build_fromclause(db_graph, columns)

Get a joined table from pathing instructions for joining MySQL Tables. :param db_graph: SQLAlchemy structured NetworkX Graph object. :type db_graph: Graph :param columns: SQLAlchemy Column object(s). :type columns: Column :type columns: list :returns: SQLAlchemy Table object containing left outer-joined tables. :rtype: Table

pdm_utils.functions.querying.build_graph(metadata)

Get a NetworkX Graph object populated from a SQLAlchemy MetaData object.

Parameters

metadata (MetaData) – Reflected SQLAlchemy MetaData object.

Returns

Populated and structured NetworkX Graph object.

Return type

Column

pdm_utils.functions.querying.build_onclause(db_graph, source_table, adjacent_table)
Creates a SQLAlchemy BinaryExpression object for a MySQL ON clause

expression

Parameters
  • db_graph (Graph) – SQLAlchemy structured NetworkX Graph object.

  • source_table (str) – Case-insensitive MySQL table name.

  • adjacent_table – Case-insensitive MySQL table name.

Returns

MySQL foreign key related SQLAlchemy BinaryExpression object.

Return type

BinaryExpression

pdm_utils.functions.querying.build_select(db_graph, columns, where=None, order_by=None, add_in=None, having=None, group_by=None)

Get MySQL SELECT expression SQLAlchemy executable.

Parameters
  • db_graph (Graph) – SQLAlchemy structured NetworkX Graph object.

  • columns (list) – SQLAlchemy Column object(s).

  • where (list) – MySQL WHERE clause-related SQLAlchemy object(s).

  • order_by (list) – MySQL ORDER BY clause-related SQLAlchemy object(s).

  • add_in (list) – MySQL Column-related inputs to be considered for joining.

  • having (list) – MySQL HAVING clause-related SQLAlchemy object(s).

  • group_by (list) – MySQL GROUP BY clause-related SQLAlchemy object(s).

Returns

MySQL SELECT expression-related SQLAlchemy executable.

Return type

Select

pdm_utils.functions.querying.build_where_clause(db_graph, filter_expression)
Creates a SQLAlchemy BinaryExpression object from a MySQL WHERE

clause expression.

Parameters
  • db_graph (Graph) – SQLAlchemy structured NetworkX Graph object.

  • filter_expression (str) – MySQL where clause expression.

Returns

MySQL expression-related SQLAlchemy BinaryExpression object.

Return type

BinaryExpression

pdm_utils.functions.querying.execute(engine, executable, in_column=None, values=[], limit=8000, return_dict=True)

Use SQLAlchemy Engine to execute a MySQL query.

Parameters
  • engine (Engine) – SQLAlchemy Engine object used for executing queries.

  • executable (str) – Input a executable MySQL query.

  • return_dict (Boolean) – Toggle whether execute returns dict or tuple.

Returns

Results from execution of given MySQL query.

Return type

list[dict]

Return type

list[tuple]

pdm_utils.functions.querying.execute_value_subqueries(engine, executable, in_column, source_values, return_dict=True, limit=8000)

Query with a conditional on a set of values using subqueries.

Parameters
  • engine (Engine) – SQLAlchemy Engine object used for executing queries.

  • executable (str) – Input a executable MySQL query.

  • in_column (Column) – SQLAlchemy Column object.

  • source_values (list[str]) – Values from specified MySQL column.

  • return_dict (Boolean) – Toggle whether to return data as a dictionary.

  • limit (int) – SQLAlchemy IN clause query length limiter.

Returns

List of grouped data for each value constraint.

Return type

list

pdm_utils.functions.querying.extract_column(column, check=None)

Get a column from a supported SQLAlchemy Column-related object.

Parameters
  • column (UnaryExpression) – SQLAlchemy Column-related object.

  • check (<type BinaryExpression>) – SQLAlchemy Column-related object type.

Returns

Corresponding SQLAlchemy Column object.

Return type

Column

pdm_utils.functions.querying.extract_columns(columns, check=None)

Get a column from a supported SQLAlchemy Column-related object(s).

Parameters
  • column (UnaryExpression) – SQLAlchemy Column-related object.

  • check (<type BinaryExpression>) – SQLAlchemy Column-related object type.

Returns

List of SQLAlchemy Column objects.

Return type

list[Column]

pdm_utils.functions.querying.first_column(engine, executable, in_column=None, values=[], limit=8000)

Use SQLAlchemy Engine to execute and return the first column of fields.

Parameters
  • engine (Engine) – SQLAlchemy Engine object used for executing queries.

  • executable (str) – Input an executable MySQL query.

Returns

A column for a set of MySQL values.

Return type

list[str]

pdm_utils.functions.querying.first_column_value_subqueries(engine, executable, in_column, source_values, limit=8000)

Query with a conditional on a set of values using subqueries.

Parameters
  • engine (Engine) – SQLAlchemy Engine object used for executing queries.

  • executable (str) – Input a executable MySQL query.

  • in_column (Column) – SQLAlchemy Column object.

  • source_values (list[str]) – Values from specified MySQL column.

  • return_dict (Boolean) – Toggle whether to return data as a dictionary.

  • limit (int) – SQLAlchemy IN clause query length limiter.

Returns

Distinct values fetched from value constraints.

Return type

list

pdm_utils.functions.querying.get_column(metadata, column)

Get a SQLAlchemy Column object, with a case-insensitive input. Input must be formatted {Table_name}.{Column_name}.

Parameters
  • metadata (MetaData) – Reflected SQLAlchemy MetaData object.

  • table (str) – Case-insensitive column name.

Returns

Corresponding SQLAlchemy Column object.

Return type

Column

pdm_utils.functions.querying.get_table(metadata, table)

Get a SQLAlchemy Table object, with a case-insensitive input.

Parameters
  • metadata (MetaData) – Reflected SQLAlchemy MetaData object.

  • table (str) – Case-insensitive table name.

Returns

Corresponding SQLAlchemy Table object.

Return type

Table

pdm_utils.functions.querying.get_table_list(columns)

Get a nonrepeating list SQLAlchemy Table objects from Column objects.

Parameters

columns (list) – SQLAlchemy Column object(s).

Returns

List of corresponding SQLAlchemy Table objects.

Return type

list

pdm_utils.functions.querying.get_table_pathing(db_graph, table_list, center_table=None)

Get pathing instructions for joining MySQL Table objects.

Parameters
  • db_graph (Graph) – SQLAlchemy structured NetworkX Graph object.

  • table_list (list[Table]) – List of SQLAlchemy Table objects.

  • center_table (Table) – SQLAlchemy Table object to begin traversals from.

Returns

2-D list containing the center table and pathing instructions.

Return type

list

pdm_utils.functions.querying.join_pathed_tables(db_graph, table_pathing)

Get a joined table from pathing instructions for joining MySQL Tables.

Parameters
  • db_graph (Graph) – SQLAlchemy structured NetworkX Graph object.

  • table_pathing (list) – 2-D list containing a Table and pathing lists.

Returns

SQLAlchemy Table object containing left outer-joined tables.

Return type

Table

pdm_utils.functions.querying.query(session, db_graph, table_map, where=None)

Use SQLAlchemy session to retrieve ORM objects from a mapped object.

Parameters
  • session (Session) – Bound and connected SQLAlchemy Session object.

  • table_map – SQLAlchemy ORM map object.

  • where (list) – MySQL WHERE clause-related SQLAlchemy object(s).

  • order_by (list) – MySQL ORDER BY clause-related SQLAlchemy object(s).

Returns

List of mapped object instances.

Return type

list

pdm_utils.functions.server module

Misc. functions to utilizes server.

pdm_utils.functions.server.get_transport(host)

Create paramiko Transport with the server name.

Parameters

host (str) – Server to connect to.

Returns

Paramiko Transport object. If the server is not available, None is returned.

Return type

Transport

pdm_utils.functions.server.set_log_file(filepath)

Set the filepath used to stored the Paramiko output.

This is a soft requirement for compliance with Paramiko standards. If it is not set, paramiko throws an error.

Parameters

filepath (Path) – Path to file to log Paramiko results.

pdm_utils.functions.server.setup_sftp_conn(transport, user=None, pwd=None, attempts=1)

Get credentials and setup connection to the server.

Parameters
  • transport (Transport) – Paramiko Transport object directed towards a valid server.

  • attempts (int) – Number of attempts to connect to the server.

Returns

Paramiko SFTPClient connection. If no connection can be made, None is returned.

Return type

SFTPClient

pdm_utils.functions.server.upload_file(sftp, local_filepath, remote_filepath)

Upload a file to the server.

Parameters
  • sftp (SFTPClient) – Paramiko SFTPClient connection to a server.

  • local_filepath (str) – Absoluate path to file to be uploaded.

  • remote_filepath (str) – Absoluate path to server destination.

Returns

Indicates whether upload was successful.

Return type

bool

pdm_utils.functions.tickets module

Misc. functions to manipulate tickets.

pdm_utils.functions.tickets.construct_tickets(list_of_data_dict, eval_data_dict, description_field, required_keys, optional_keys, keywords)

Construct pdm_utils ImportTickets from parsed data dictionaries.

Parameters
  • list_of_data_dict (list) – List of import ticket data dictionaries.

  • eval_data_dict (dict) – Dictionary of boolean evaluation flags.

  • description_field (str) – Default value to set ticket.description_field attribute if not present in the data dictionary.

  • required_keys (set) – Set of keys required to be in the data dictionary.

  • optional_keys (set) – Set of optional keys that are not required to be in the data dictionary.

  • keywords (set) – Set of valid keyword values that are handled differently than other values.

Returns

List of pdm_utils ImportTicket objects.

Return type

list

pdm_utils.functions.tickets.get_genome(tkt, gnm_type='')

Construct a pdm_utils Genome object from a pdm_utils ImportTicket object.

Parameters
  • tkt (ImportTicket) – A pdm_utils ImportTicket object.

  • gnm_type (str) – Identifier for the type of genome.

Returns

A pdm_utils Genome object.

Return type

Genome

pdm_utils.functions.tickets.identify_duplicates(list_of_tickets, null_set={})

Compare all tickets to each other to identify ticket conflicts.

Identifies if the same id, PhageID, and Accession is present in multiple tickets.

Parameters
  • list_of_tickets (list) – A list of pdm_utils ImportTicket objects.

  • null_set (set) – A set of values that may be expected to be duplicated, that should not throw errors.

Returns

tuple (tkt_id_dupes, phage_id_dupes) WHERE tkt_id_dupes(set) is a set of duplicate ticket ids. phage_id_dupes(set) is a set of duplicate PhageIDs.

Return type

tuple

pdm_utils.functions.tickets.modify_import_data(data_dict, required_keys, optional_keys, keywords)

Modifies ticket data to conform to requirements for an ImportTicket object.

Parameters
  • data_dict (dict) – Dictionary of import ticket data.

  • required_keys (set) – Set of keys required to be in the data dictionary.

  • optional_keys (set) – Set of optional keys that are not required to be in the data dictionary.

  • keywords (set) – Set of valid keyword values that are handled differently than other values.

Returns

Indicates if the ticket is structured properly.

Return type

bool

pdm_utils.functions.tickets.parse_import_ticket_data(data_dict)

Converts import ticket data to a ImportTicket object.

Parameters

data_dict (dict) –

A dictionary of data with the following keys:

  1. Import action type

  2. Primary PhageID

  3. Host

  4. Cluster

  5. Subcluster

  6. Status

  7. Annotation Author (int)

  8. Feature field

  9. Accession

  10. Retrieve Record (int)

  11. Evaluation mode

Returns

A pdm_utils ImportTicket object.

Return type

ImportTicket

pdm_utils.functions.tickets.set_dict_value(data_dict, key, first, second)

Set the value for a specific key based on ‘type’ key-value.

Parameters
  • data_dict (dict) – Dictionary of import ticket data.

  • key (str) – Dictionary key to change value of.

  • first (str) – Value to assign to ‘key’ if ‘type’ == ‘add’.

  • second (str) – Value to assign to ‘key’ if ‘type’ != ‘add’.

pdm_utils.functions.tickets.set_empty(data_dict)

Convert None values to an empty string.

Parameters

data_dict (dict) – Dictionary of import ticket data.

pdm_utils.functions.tickets.set_keywords(data_dict, keywords)

Convert specific values in a dictionary to lowercase.

Parameters
  • data_dict (dict) – Dictionary of import ticket data.

  • keywords (set) – Set of valid keyword values that are handled differently than other values.

pdm_utils.functions.tickets.set_missing_keys(data_dict, expected_keys)

Add a list of keys-values to a dictionary if it doesn’t have those keys.

Parameters
  • data_dict (dict) – Dictionary of import ticket data.

  • expected_keys (set) – Set of keys expected to be in the dictionary.

Module contents