basic

Misc. base/simple functions. These should not require import of other modules in this package to prevent circular imports.

pdm_utils.functions.basic.ask_yes_no(prompt='', response_attempt=1)

Function to get the user’s yes/no response to a question.

Accepts variations of yes/y, true/t, no/n, false/f, exit/quit/q.

Parameters
  • prompt (str) – the question to ask the user.

  • response_attempt (int) – The number of the number of attempts allowed before the function exits. This prevents the script from getting stuck in a loop.

Returns

The default is False (e.g. user hits Enter without typing anything else), but variations of yes or true responses will return True instead. If the response is ‘exit’ or ‘quit’, the loop is exited and None is returned.

Return type

bool, None

pdm_utils.functions.basic.check_empty(value, lower=True)

Checks if the value represents a null value.

Parameters
  • value (misc.) – Value to be checked against the empty set.

  • lower (bool) – Indicates whether the input value should be lowercased prior to checking.

Returns

Indicates whether the value is present in the empty set.

Return type

bool

pdm_utils.functions.basic.check_value_expected_in_set(value, set1, expect=True)

Check if a value is present within a set and if it is expected.

Parameters
  • value (misc.) – The value to be checked.

  • set1 (set) – The reference set of values.

  • expect (bool) – Indicates if ‘value’ is expected to be present in ‘set1’.

Returns

The result of the evaluation.

Return type

bool

pdm_utils.functions.basic.check_value_in_two_sets(value, set1, set2)

Check if a value is present within two sets.

Parameters
  • value (misc.) – The value to be checked.

  • set1 (set) – The first reference set of values.

  • set2 (set) – The second reference set of values.

Returns

The result of the evaluation, indicating whether the value is present within:

  1. only the ‘first’ set

  2. only the ‘second’ set

  3. ’both’ sets

  4. ’neither’ set

Return type

str

pdm_utils.functions.basic.choose_from_list(options)

Iterate through a list of values and choose a value.

Parameters

options (list) – List of options to choose from.

Returns

the user select option of None

Return type

option or None

pdm_utils.functions.basic.choose_most_common(string, values)

Identify most common occurrence of several values in a string.

Parameters
  • string (str) – String to search.

  • values (list) – List of string characters. The order in the list indicates preference, in the case of a tie.

Returns

Value from values that occurs most.

Return type

str

pdm_utils.functions.basic.clear_screen()

Brings the command line to the top of the screen.

pdm_utils.functions.basic.compare_cluster_subcluster(cluster, subcluster)

Check if a cluster and subcluster designation are compatible.

Parameters
  • cluster (str) – The cluster value to be compared. ‘Singleton’ and ‘UNK’ are lowercased.

  • subcluster (str) – The subcluster value to be compared.

Returns

The result of the evaluation, indicating whether the two values are compatible.

Return type

bool

pdm_utils.functions.basic.compare_sets(set1, set2)

Compute the intersection and differences between two sets.

Parameters
  • set1 (set) – The first input set.

  • set2 (set) – The second input set.

Returns

tuple (set_intersection, set1_diff, set2_diff) WHERE set_intersection(set) is the set of shared values. set1_diff(set) is the set of values unique to the first set. set2_diff(set) is the set of values unique to the second set.

Return type

tuple

pdm_utils.functions.basic.convert_empty(input_value, format, upper=False)

Converts common null value formats.

Parameters
  • input_value (str, int, datetime) – Value to be re-formatted.

  • format (str) – Indicates how the value should be edited. Valid format types include: ‘empty_string’ = ‘’ ‘none_string’ = ‘none’ ‘null_string’ = ‘null’ ‘none_object’ = None ‘na_long’ = ‘not applicable’ ‘na_short’ = ‘na’ ‘n/a’ = ‘n/a’ ‘zero_string’ = ‘0’ ‘zero_num’ = 0 ‘empty_datetime_obj’ = datetime object with arbitrary date, ‘1/1/0001’

  • upper (bool) – Indicates whether the output value should be uppercased.

Returns

The re-formatted value as indicated by ‘format’.

Return type

str, int, datetime

pdm_utils.functions.basic.convert_list_to_dict(data_list, key)

Convert list of dictionaries to a dictionary of dictionaries

Parameters
  • data_list (list) – List of dictionaries.

  • key (str) – key in each dictionary to become the returned dictionary key.

Returns

Dictionary of all dictionaries. Returns an empty dictionary if all intended keys are not unique.

Return type

dict

pdm_utils.functions.basic.convert_to_decoded(values)

Converts a list of strings to utf-8 encoded values.

Parameters

values (list[bytes]) – Byte values from MySQL queries to be decoded.

Returns

List of utf-8 decoded values.

Return type

list[str]

pdm_utils.functions.basic.convert_to_encoded(values)

Converts a list of strings to utf-8 encoded values.

Parameters

values (list[str]) – Strings for a MySQL query to be encoded.

Returns

List of utf-8 encoded values.

Return type

list[bytes]

pdm_utils.functions.basic.create_indices(input_list, batch_size)

Create list of start and stop indices to split a list into batches.

Parameters
  • input_list (list) – List from which to generate batch indices.

  • batch_size (int) – Size of each batch.

Returns

List of 2-element tuples (start index, stop index).

Return type

list

pdm_utils.functions.basic.edit_suffix(value, option, suffix='_Draft')

Adds or removes the indicated suffix to an input value.

Parameters
  • value (str) – Value that will be edited.

  • option (str) – Indicates what to do with the value and suffix (‘add’, ‘remove’).

  • suffix (str) – The suffix that will be added or removed.

Returns

The edited value. The suffix is not added if the input value already has the suffix.

Return type

str

pdm_utils.functions.basic.expand_path(input_path)

Convert a non-absolute path into an absolute path.

Parameters

input_path (str) – The path to be expanded.

Returns

The expanded path.

Return type

str

pdm_utils.functions.basic.find_expression(expression, list_of_items)

Counts the number of items with matches to a regular expression.

Parameters
  • expression (re) – Regular expression object

  • list_of_items (list) – List of items that will be searched with the regular expression.

Returns

Number of times the regular expression was identified in the list.

Return type

int

pdm_utils.functions.basic.get_user_pwd(user_prompt='Username: ', pwd_prompt='Password: ')

Get username and password.

Parameters
  • user_prompt (str) – Displayed description when prompted for username.

  • pwd_prompt (str) – Displayed description when prompted for password.

Returns

tuple (username, password) WHERE username(str) is the user-supplied username. password(str) is the user-supplied password.

Return type

tuple

pdm_utils.functions.basic.get_values_from_dict_list(list_of_dicts)

Convert a list of dictionaries to a set of the dictionary values.

Parameters

list_of_dicts (list) – List of dictionaries.

Returns

Set of values from all dictionaries in the list.

Return type

set

pdm_utils.functions.basic.get_values_from_tuple_list(list_of_tuples)

Convert a list of tuples to a set of the tuple values.

Parameters

list_of_tuples (list) – List of tuples.

Returns

Set of values from all tuples in the list.

Return type

set

pdm_utils.functions.basic.identify_contents(path_to_folder, kind=None, ignore_set={})

Create a list of filenames and/or folders from an indicated directory.

Parameters
  • path_to_folder (Path) – A valid directory path.

  • kind (str) – (“file”, “dir”), corresponding with paths to be checked as either files or directories.

  • ignore_set (set) – A set of strings representing file or folder names to ignore.

Returns

List of valid contents in the directory.

Return type

list

pdm_utils.functions.basic.identify_nested_items(complete_list)

Identify nested and non-nested two-element tuples in a list.

Parameters

complete_list (list) – List of tuples that will be evaluated.

Returns

tuple (not_nested_set, nested_set) WHERE not_nested_set(set) is a set of non-nested tuples. nested_set(set) is a set of nested tuples.

Return type

tuple

pdm_utils.functions.basic.identify_one_list_duplicates(item_list)

Identify duplicate items within a list.

Parameters

item_list (list) – The input list to be checked.

Returns

The set of non-unique/duplicated items.

Return type

set

pdm_utils.functions.basic.identify_two_list_duplicates(item1_list, item2_list)

Identify duplicate items between two lists.

Parameters
  • item1_list (list) – The first input list to be checked.

  • item2_list (list) – The second input list to be checked.

Returns

The set of non-unique/duplicated items between the two lists (but not duplicate items within each list).

Return type

set

pdm_utils.functions.basic.identify_unique_items(complete_list)

Identify unique and non-unique items in a list.

Parameters

complete_list (list) – List of items that will be evaluated.

Returns

tuple (unique_set, duplicate_set) WHERE unique_set(set) is a set of all unique/non-duplicated items. duplicate_set(set) is a set of non-unique/duplicated items. non-informative/generic data is removed.

Return type

tuple

pdm_utils.functions.basic.increment_histogram(data, histogram)

Increments a dictionary histogram based on given data.

Parameters
  • data (list) – Data to be used to index or create new keys in the histogram.

  • histogram (dict) – Dictionary containing keys whose values contain counts.

pdm_utils.functions.basic.invert_dictionary(dictionary)

Inverts a dictionary, where the values and keys are swapped.

Parameters

dictionary (dict) – A dictionary to be inverted.

Returns

Returns an inverted dictionary of the given dictionary.

Return type

dict

pdm_utils.functions.basic.is_float(string)

Check if string can be converted to float.

pdm_utils.functions.basic.join_strings(input_list, delimiter=' ')

Open file and retrieve a dictionary of data.

Parameters
  • input_list (list) – List of values to join.

  • delimiter (str) – Delimiter used between values.

Returns

Concatenated values, excluding all None and ‘’ values.

Return type

str

pdm_utils.functions.basic.lower_case(value)

Return the value lowercased if it is within a specific set of values.

Parameters

value (str) – The value to be checked.

Returns

The lowercased value if it is equivalent to ‘none’, ‘retrieve’, or ‘retain’.

Return type

str

pdm_utils.functions.basic.make_new_dir(output_dir, new_dir, attempt=1, mkdir=True)

Make a new directory.

Checks to verify the new directory name is valid and does not already exist. If it already exists, it attempts to extend the name with an integer suffix.

Parameters
  • output_dir (Path) – Full path to the directory where the new directory will be created.

  • new_dir (Path) – Name of the new directory to be created.

  • attempt (int) – Number of attempts to create the directory.

Returns

If successful, the full path of the created directory. If unsuccessful, None.

Return type

Path, None

pdm_utils.functions.basic.make_new_file(output_dir, new_file, ext, attempt=1)

Make a new file.

Checks to verify the new file name is valid and does not already exist. If it already exists, it attempts to extend the name with an integer suffix.

Parameters
  • output_dir (Path) – Full path to the directory where the new directory will be created.

  • new_file (Path) – Name of the new file to be created.

  • ext (str) – Name of the file extension to be used.

  • attempt (int) – Number of attempts to create the file.

Returns

If successful, the full path of the created file. If unsuccessful, None.

Return type

Path, None

pdm_utils.functions.basic.match_items(list1, list2)

Match values of two lists and return several results.

Parameters
  • list1 (list) – The first input list.

  • list2 (list) – The second input list.

Returns

tuple (matched_unique_items, set1_unmatched_unique_items, set2_unmatched_unique_items, set1_duplicate_items, set2_duplicate_items) WHERE matched_unique_items(set) is the set of matched unique values. set1_unmatched_unique_items(set) is the set of unmatched unique values from the first list. set2_unmatched_unique_items(set) is the set of unmatched unique values from the second list. set1_duplicate_items(set) is the the set of duplicate values from the first list. set2_duplicate_items(set) is the set of unmatched unique values from the second list.

Return type

tuple

pdm_utils.functions.basic.merge_set_dicts(dict1, dict2)

Merge two dictionaries of sets.

Parameters
  • dict1 (dict) – First dictionary of sets.

  • dict2 (dict) – Second dictionary of sets.

Returns

Merged dictionary containing all keys from both dictionaries, and for each shared key the value is a set of merged values.

Return type

dict

pdm_utils.functions.basic.parse_flag_file(flag_file)

Parse a file to an evaluation flag dictionary.

Parameters

flag_file (str) – A two-column csv-formatted file WHERE 1. evaluation flag 2. ‘True’ or ‘False’

Returns

A dictionary WHERE keys (str) are evaluation flags values (bool) indicate the flag setting Only flags that contain boolean values are returned.

Return type

dict

pdm_utils.functions.basic.parse_names_from_record_field(description)

Attempts to parse the phage/plasmid/prophage name and host genus from a given string. :param description: the input string to be parsed :type description: str :return: name, host_genus

pdm_utils.functions.basic.partition_list(data_list, size)

Chunks list into a list of lists with the given size.

Parameters
  • data_list (list) – List to be split into equal-sized lists.

  • size – Length of the resulting list chunks.

  • size – int

Returns

Returns list of lists with length of the given size.

Return type

list[list]

pdm_utils.functions.basic.prepare_filepath(folder_path, file_name, folder_name=None)

Prepare path to new file.

Parameters
  • folder_path (Path) – Path to the directory to contain the file.

  • file_name (str) – Name of the file.

  • folder_name (Path) – Name of sub-directory to create.

Returns

Path to file in directory.

Return type

Path

pdm_utils.functions.basic.reformat_coordinates(start, stop, current, new)

Converts common coordinate formats.

The type of coordinate formats include:

‘0_half_open’:

0-based half-open intervals that is the common format for BAM files and UCSC Browser database. This format seems to be more efficient when performing genomics computations.

‘1_closed’:

1-based closed intervals that is the common format for the MySQL Database, UCSC Browser, the Ensembl genomics database, VCF files, GFF files. This format seems to be more intuitive and used for visualization.

The function assumes coordinates reflect the start and stop boundaries (where the start coordinates is smaller than the stop coordinate), instead of transcription start and stop coordinates.

Parameters
  • start (int) – Start coordinate

  • stop (int) – Stop coordinate

  • current (str) – Indicates the indexing format of the input coordinates.

  • new (str) – Indicates the indexing format of the output coordinates.

Returns

The re-formatted start and stop coordinates.

Return type

int

pdm_utils.functions.basic.reformat_description(raw_description)

Reformat a gene description.

Parameters

raw_description (str) – Input value to be reformatted.

Returns

tuple (description, processed_description) WHERE description(str) is the original value stripped of leading and trailing whitespace. processed_description(str) is the reformatted value, in which non-informative/generic data is removed.

Return type

tuple

pdm_utils.functions.basic.reformat_strand(input_value, format, case=False)

Converts common strand orientation formats.

Parameters
  • input_value (str, int) – Value that will be edited.

  • format (str) – Indicates how the value should be edited. Valid format types include: ‘fr_long’ (‘forward’, ‘reverse’) ‘fr_short’ (‘f’, ‘r’) ‘fr_abbrev1’ (‘for’, ‘rev’) ‘fr_abbrev2’ (‘fwd’, ‘rev’) ‘tb_long’ (‘top’, ‘bottom’) ‘tb_short’ (‘t’, ‘b’) ‘wc_long’ (‘watson’, ‘crick’) ‘wc_short’ (‘w’,’c’) ‘operator’ (‘+’, ‘-‘) ‘numeric’ (1, -1).

  • case (bool) – Indicates whether the output value should be capitalized.

Returns

The re-formatted value as indicated by ‘format’.

Return type

str, int

pdm_utils.functions.basic.select_option(prompt, valid_response_set)

Select an option from a set of options.

Parameters
  • prompt (str) – Message to display before displaying option.

  • valid_response_set (set) – Set of valid options to choose.

Returns

option

Return type

str, int

pdm_utils.functions.basic.set_path(path, kind=None, expect=True)

Confirm validity of path argument.

Parameters
  • path (Path) – path

  • kind (str) – (“file”, “dir”), corresponding with paths to be checked as either files or directories.

  • expect (bool) – Indicates if the path is expected to the indicated kind.

Returns

Absolute path if valid, otherwise sys.exit is called.

Return type

Path

pdm_utils.functions.basic.sort_histogram(histogram, descending=True)

Sorts a dictionary by its values and returns the sorted histogram.

Parameters

histogram (dict) – Dictionary containing keys whose values contain counts.

Returns

An ordered dict from items from the histogram sorted by value.

Return type

OrderedDict

pdm_utils.functions.basic.sort_histogram_keys(histogram, descending=True)

Sorts a dictionary by its values and returns the sorted histogram.

Parameters

histogram (dict) – Dictionary containing keys whose values contain counts.

Returns

A list from keys from the histogram sorted by value.

Return type

list

pdm_utils.functions.basic.split_string(string)

Split a string based on alphanumeric characters.

Iterates through a string, identifies the first position in which the character is a float, and creates two strings at this position.

Parameters

string (str) – The value to be split.

Returns

tuple (left, right) WHERE left(str) is the left portion of the input value prior to the first numeric character and only contains alphabetic characters (or will be ‘’). right(str) is the right portion of the input value after the first numeric character and only contains numeric characters (or will be ‘’).

Return type

tuple

pdm_utils.functions.basic.trim_characters(string)

Remove leading and trailing generic characters from a string.

Parameters

string (str) – Value that will be trimmed. Characters that will be removed include: ‘.’, ‘,’, ‘;’, ‘-’, ‘_’.

Returns

Edited value.

Return type

str

pdm_utils.functions.basic.truncate_value(value, length, suffix)

Truncate a string.

Parameters
  • value (str) – String that should be truncated.

  • length (int) – Final length of truncated string.

  • suffix (str) – String that should be appended to truncated string.

Returns

the truncated string

Return type

str

pdm_utils.functions.basic.verify_path(filepath, kind=None)

Verifies that a given path exists.

Parameters
  • filepath (str) – full path to the desired file/directory.

  • kind (str) – (“file”, “dir”), corresponding with paths to be checked as either files or directories.

Return Boolean

True if path is verified, False otherwise.

pdm_utils.functions.basic.verify_path2(path, kind=None, expect=True)

Verifies that a given path exists.

Parameters
  • path (Path) – path

  • kind (str) – (“file”, “dir”), corresponding with paths to be checked as either files or directories.

  • expect (bool) – Indicates if the path is expected to the indicated kind.

Returns

tuple (result, message) WHERE result(bool) indicates if the expectation was satisfied. message(str) is a description of the result.

Return type

tuple