basic¶

Misc. base/simple functions. These should not require import of other modules in this package to prevent circular imports.

pdm_utils.functions.basic.ask_yes_no(prompt='', response_attempt=1)¶

Function to get the user’s yes/no response to a question.

Accepts variations of yes/y, true/t, no/n, false/f, exit/quit/q.

Parameters

prompt (str) – the question to ask the user.
response_attempt (int) – The number of the number of attempts allowed before the function exits. This prevents the script from getting stuck in a loop.

Returns

The default is False (e.g. user hits Enter without typing anything else), but variations of yes or true responses will return True instead. If the response is ‘exit’ or ‘quit’, the loop is exited and None is returned.

Return type

bool, None

pdm_utils.functions.basic.check_empty(value, lower=True)¶

Checks if the value represents a null value.

Parameters

value (misc.) – Value to be checked against the empty set.
lower (bool) – Indicates whether the input value should be lowercased prior to checking.

Returns

Indicates whether the value is present in the empty set.

Return type

bool

pdm_utils.functions.basic.check_value_expected_in_set(value, set1, expect=True)¶

Check if a value is present within a set and if it is expected.

Parameters

value (misc.) – The value to be checked.
set1 (set) – The reference set of values.
expect (bool) – Indicates if ‘value’ is expected to be present in ‘set1’.

Returns

The result of the evaluation.

Return type

bool

pdm_utils.functions.basic.check_value_in_two_sets(value, set1, set2)¶

Check if a value is present within two sets.

Parameters

value (misc.) – The value to be checked.
set1 (set) – The first reference set of values.
set2 (set) – The second reference set of values.

Returns

The result of the evaluation, indicating whether the value is present within:

only the ‘first’ set

only the ‘second’ set

’both’ sets

’neither’ set

Return type

str

pdm_utils.functions.basic.choose_from_list(options)¶

Iterate through a list of values and choose a value.

Parameters: options (list) – List of options to choose from.
Returns: the user select option of None
Return type: option or None

pdm_utils.functions.basic.choose_most_common(string, values)¶

Identify most common occurrence of several values in a string.

Parameters

string (str) – String to search.
values (list) – List of string characters. The order in the list indicates preference, in the case of a tie.

Returns

Value from values that occurs most.

Return type

str

pdm_utils.functions.basic.clear_screen()¶: Brings the command line to the top of the screen.

pdm_utils.functions.basic.compare_cluster_subcluster(cluster, subcluster)¶

Check if a cluster and subcluster designation are compatible.

Parameters

cluster (str) – The cluster value to be compared. ‘Singleton’ and ‘UNK’ are lowercased.
subcluster (str) – The subcluster value to be compared.

Returns

The result of the evaluation, indicating whether the two values are compatible.

Return type

bool

pdm_utils.functions.basic.compare_sets(set1, set2)¶

Compute the intersection and differences between two sets.

Parameters

set1 (set) – The first input set.
set2 (set) – The second input set.

Returns

tuple (set_intersection, set1_diff, set2_diff) WHERE set_intersection(set) is the set of shared values. set1_diff(set) is the set of values unique to the first set. set2_diff(set) is the set of values unique to the second set.

Return type

tuple

pdm_utils.functions.basic.convert_empty(input_value, format, upper=False)¶

Converts common null value formats.

Parameters

input_value (str, int, datetime) – Value to be re-formatted.
format (str) – Indicates how the value should be edited. Valid format types include: ‘empty_string’ = ‘’ ‘none_string’ = ‘none’ ‘null_string’ = ‘null’ ‘none_object’ = None ‘na_long’ = ‘not applicable’ ‘na_short’ = ‘na’ ‘n/a’ = ‘n/a’ ‘zero_string’ = ‘0’ ‘zero_num’ = 0 ‘empty_datetime_obj’ = datetime object with arbitrary date, ‘1/1/0001’
upper (bool) – Indicates whether the output value should be uppercased.

Returns

The re-formatted value as indicated by ‘format’.

Return type

str, int, datetime

pdm_utils.functions.basic.convert_list_to_dict(data_list, key)¶

Convert list of dictionaries to a dictionary of dictionaries

Parameters

data_list (list) – List of dictionaries.
key (str) – key in each dictionary to become the returned dictionary key.

Returns

Dictionary of all dictionaries. Returns an empty dictionary if all intended keys are not unique.

Return type

dict

pdm_utils.functions.basic.convert_to_decoded(values)¶

Converts a list of strings to utf-8 encoded values.

Parameters: values (list[bytes]) – Byte values from MySQL queries to be decoded.
Returns: List of utf-8 decoded values.
Return type: list[str]

pdm_utils.functions.basic.convert_to_encoded(values)¶

Converts a list of strings to utf-8 encoded values.

Parameters: values (list[str]) – Strings for a MySQL query to be encoded.
Returns: List of utf-8 encoded values.
Return type: list[bytes]

pdm_utils.functions.basic.create_indices(input_list, batch_size)¶

Create list of start and stop indices to split a list into batches.

Parameters

input_list (list) – List from which to generate batch indices.
batch_size (int) – Size of each batch.

Returns

List of 2-element tuples (start index, stop index).

Return type

list

pdm_utils.functions.basic.edit_suffix(value, option, suffix='_Draft')¶

Adds or removes the indicated suffix to an input value.

Parameters

value (str) – Value that will be edited.
option (str) – Indicates what to do with the value and suffix (‘add’, ‘remove’).
suffix (str) – The suffix that will be added or removed.

Returns

The edited value. The suffix is not added if the input value already has the suffix.

Return type

str

pdm_utils.functions.basic.expand_path(input_path)¶

Convert a non-absolute path into an absolute path.

Parameters: input_path (str) – The path to be expanded.
Returns: The expanded path.
Return type: str

pdm_utils.functions.basic.find_expression(expression, list_of_items)¶

Counts the number of items with matches to a regular expression.

Parameters

expression (re) – Regular expression object
list_of_items (list) – List of items that will be searched with the regular expression.

Returns

Number of times the regular expression was identified in the list.

Return type

int

pdm_utils.functions.basic.get_user_pwd(user_prompt='Username: ', pwd_prompt='Password: ')¶

Get username and password.

Parameters

user_prompt (str) – Displayed description when prompted for username.
pwd_prompt (str) – Displayed description when prompted for password.

Returns

tuple (username, password) WHERE username(str) is the user-supplied username. password(str) is the user-supplied password.

Return type

tuple

pdm_utils.functions.basic.get_values_from_dict_list(list_of_dicts)¶

Convert a list of dictionaries to a set of the dictionary values.

Parameters: list_of_dicts (list) – List of dictionaries.
Returns: Set of values from all dictionaries in the list.
Return type: set

pdm_utils.functions.basic.get_values_from_tuple_list(list_of_tuples)¶

Convert a list of tuples to a set of the tuple values.

Parameters: list_of_tuples (list) – List of tuples.
Returns: Set of values from all tuples in the list.
Return type: set

pdm_utils.functions.basic.identify_contents(path_to_folder, kind=None, ignore_set={})¶

Create a list of filenames and/or folders from an indicated directory.

Parameters

path_to_folder (Path) – A valid directory path.
kind (str) – (“file”, “dir”), corresponding with paths to be checked as either files or directories.
ignore_set (set) – A set of strings representing file or folder names to ignore.

Returns

List of valid contents in the directory.

Return type

list

pdm_utils.functions.basic.identify_nested_items(complete_list)¶

Identify nested and non-nested two-element tuples in a list.

Parameters: complete_list (list) – List of tuples that will be evaluated.
Returns: tuple (not_nested_set, nested_set) WHERE not_nested_set(set) is a set of non-nested tuples. nested_set(set) is a set of nested tuples.
Return type: tuple

pdm_utils.functions.basic.identify_one_list_duplicates(item_list)¶

Identify duplicate items within a list.

Parameters: item_list (list) – The input list to be checked.
Returns: The set of non-unique/duplicated items.
Return type: set

pdm_utils.functions.basic.identify_two_list_duplicates(item1_list, item2_list)¶

Identify duplicate items between two lists.

Parameters

item1_list (list) – The first input list to be checked.
item2_list (list) – The second input list to be checked.

Returns

The set of non-unique/duplicated items between the two lists (but not duplicate items within each list).

Return type

set

pdm_utils.functions.basic.identify_unique_items(complete_list)¶

Identify unique and non-unique items in a list.

Parameters: complete_list (list) – List of items that will be evaluated.
Returns: tuple (unique_set, duplicate_set) WHERE unique_set(set) is a set of all unique/non-duplicated items. duplicate_set(set) is a set of non-unique/duplicated items. non-informative/generic data is removed.
Return type: tuple

pdm_utils.functions.basic.increment_histogram(data, histogram)¶

Increments a dictionary histogram based on given data.

Parameters

data (list) – Data to be used to index or create new keys in the histogram.
histogram (dict) – Dictionary containing keys whose values contain counts.

pdm_utils.functions.basic.invert_dictionary(dictionary)¶

Inverts a dictionary, where the values and keys are swapped.

Parameters: dictionary (dict) – A dictionary to be inverted.
Returns: Returns an inverted dictionary of the given dictionary.
Return type: dict

pdm_utils.functions.basic.is_float(string)¶: Check if string can be converted to float.

pdm_utils.functions.basic.join_strings(input_list, delimiter=' ')¶

Open file and retrieve a dictionary of data.

Parameters

input_list (list) – List of values to join.
delimiter (str) – Delimiter used between values.

Returns

Concatenated values, excluding all None and ‘’ values.

Return type

str

pdm_utils.functions.basic.lower_case(value)¶

Return the value lowercased if it is within a specific set of values.

Parameters: value (str) – The value to be checked.
Returns: The lowercased value if it is equivalent to ‘none’, ‘retrieve’, or ‘retain’.
Return type: str

pdm_utils.functions.basic.make_new_dir(output_dir, new_dir, attempt=1, mkdir=True)¶

Make a new directory.

Checks to verify the new directory name is valid and does not already exist. If it already exists, it attempts to extend the name with an integer suffix.

Parameters

output_dir (Path) – Full path to the directory where the new directory will be created.
new_dir (Path) – Name of the new directory to be created.
attempt (int) – Number of attempts to create the directory.

Returns

If successful, the full path of the created directory. If unsuccessful, None.

Return type

Path, None

pdm_utils.functions.basic.make_new_file(output_dir, new_file, ext, attempt=1)¶

Make a new file.

Checks to verify the new file name is valid and does not already exist. If it already exists, it attempts to extend the name with an integer suffix.

Parameters

output_dir (Path) – Full path to the directory where the new directory will be created.
new_file (Path) – Name of the new file to be created.
ext (str) – Name of the file extension to be used.
attempt (int) – Number of attempts to create the file.

Returns

If successful, the full path of the created file. If unsuccessful, None.

Return type

Path, None

pdm_utils.functions.basic.match_items(list1, list2)¶

Match values of two lists and return several results.

Parameters

list1 (list) – The first input list.
list2 (list) – The second input list.

Returns

tuple (matched_unique_items, set1_unmatched_unique_items, set2_unmatched_unique_items, set1_duplicate_items, set2_duplicate_items) WHERE matched_unique_items(set) is the set of matched unique values. set1_unmatched_unique_items(set) is the set of unmatched unique values from the first list. set2_unmatched_unique_items(set) is the set of unmatched unique values from the second list. set1_duplicate_items(set) is the the set of duplicate values from the first list. set2_duplicate_items(set) is the set of unmatched unique values from the second list.

Return type

tuple

pdm_utils.functions.basic.merge_set_dicts(dict1, dict2)¶

Merge two dictionaries of sets.

Parameters

dict1 (dict) – First dictionary of sets.
dict2 (dict) – Second dictionary of sets.

Returns

Merged dictionary containing all keys from both dictionaries, and for each shared key the value is a set of merged values.

Return type

dict

pdm_utils.functions.basic.parse_flag_file(flag_file)¶

Parse a file to an evaluation flag dictionary.

Parameters: flag_file (str) – A two-column csv-formatted file WHERE 1. evaluation flag 2. ‘True’ or ‘False’
Returns: A dictionary WHERE keys (str) are evaluation flags values (bool) indicate the flag setting Only flags that contain boolean values are returned.
Return type: dict

pdm_utils.functions.basic.parse_names_from_record_field(description)¶: Attempts to parse the phage/plasmid/prophage name and host genus from a given string. :param description: the input string to be parsed :type description: str :return: name, host_genus

pdm_utils.functions.basic.partition_list(data_list, size)¶

Chunks list into a list of lists with the given size.

Parameters

data_list (list) – List to be split into equal-sized lists.
size – Length of the resulting list chunks.
size – int

Returns

Returns list of lists with length of the given size.

Return type

list[list]

pdm_utils.functions.basic.prepare_filepath(folder_path, file_name, folder_name=None)¶

Prepare path to new file.

Parameters

folder_path (Path) – Path to the directory to contain the file.
file_name (str) – Name of the file.
folder_name (Path) – Name of sub-directory to create.

Returns

Path to file in directory.

Return type

Path

pdm_utils.functions.basic.reformat_coordinates(start, stop, current, new)¶

Converts common coordinate formats.

The type of coordinate formats include:

‘0_half_open’:

0-based half-open intervals that is the common format for BAM files and UCSC Browser database. This format seems to be more efficient when performing genomics computations.

‘1_closed’:

1-based closed intervals that is the common format for the MySQL Database, UCSC Browser, the Ensembl genomics database, VCF files, GFF files. This format seems to be more intuitive and used for visualization.

The function assumes coordinates reflect the start and stop boundaries (where the start coordinates is smaller than the stop coordinate), instead of transcription start and stop coordinates.

Parameters

start (int) – Start coordinate
stop (int) – Stop coordinate
current (str) – Indicates the indexing format of the input coordinates.
new (str) – Indicates the indexing format of the output coordinates.

Returns

The re-formatted start and stop coordinates.

Return type

int

pdm_utils.functions.basic.reformat_description(raw_description)¶

Reformat a gene description.

Parameters: raw_description (str) – Input value to be reformatted.
Returns: tuple (description, processed_description) WHERE description(str) is the original value stripped of leading and trailing whitespace. processed_description(str) is the reformatted value, in which non-informative/generic data is removed.
Return type: tuple

pdm_utils.functions.basic.reformat_strand(input_value, format, case=False)¶

Converts common strand orientation formats.

Parameters

input_value (str, int) – Value that will be edited.
format (str) – Indicates how the value should be edited. Valid format types include: ‘fr_long’ (‘forward’, ‘reverse’) ‘fr_short’ (‘f’, ‘r’) ‘fr_abbrev1’ (‘for’, ‘rev’) ‘fr_abbrev2’ (‘fwd’, ‘rev’) ‘tb_long’ (‘top’, ‘bottom’) ‘tb_short’ (‘t’, ‘b’) ‘wc_long’ (‘watson’, ‘crick’) ‘wc_short’ (‘w’,’c’) ‘operator’ (‘+’, ‘-‘) ‘numeric’ (1, -1).
case (bool) – Indicates whether the output value should be capitalized.

Returns

The re-formatted value as indicated by ‘format’.

Return type

str, int

pdm_utils.functions.basic.select_option(prompt, valid_response_set)¶

Select an option from a set of options.

Parameters

prompt (str) – Message to display before displaying option.
valid_response_set (set) – Set of valid options to choose.

Returns

option

Return type

str, int

pdm_utils.functions.basic.set_path(path, kind=None, expect=True)¶

Confirm validity of path argument.

Parameters

path (Path) – path
kind (str) – (“file”, “dir”), corresponding with paths to be checked as either files or directories.
expect (bool) – Indicates if the path is expected to the indicated kind.

Returns

Absolute path if valid, otherwise sys.exit is called.

Return type

Path

pdm_utils.functions.basic.sort_histogram(histogram, descending=True)¶

Sorts a dictionary by its values and returns the sorted histogram.

Parameters: histogram (dict) – Dictionary containing keys whose values contain counts.
Returns: An ordered dict from items from the histogram sorted by value.
Return type: OrderedDict

pdm_utils.functions.basic.sort_histogram_keys(histogram, descending=True)¶

Sorts a dictionary by its values and returns the sorted histogram.

Parameters: histogram (dict) – Dictionary containing keys whose values contain counts.
Returns: A list from keys from the histogram sorted by value.
Return type: list

pdm_utils.functions.basic.split_string(string)¶

Split a string based on alphanumeric characters.

Iterates through a string, identifies the first position in which the character is a float, and creates two strings at this position.

Parameters: string (str) – The value to be split.
Returns: tuple (left, right) WHERE left(str) is the left portion of the input value prior to the first numeric character and only contains alphabetic characters (or will be ‘’). right(str) is the right portion of the input value after the first numeric character and only contains numeric characters (or will be ‘’).
Return type: tuple

pdm_utils.functions.basic.trim_characters(string)¶

Remove leading and trailing generic characters from a string.

Parameters: string (str) – Value that will be trimmed. Characters that will be removed include: ‘.’, ‘,’, ‘;’, ‘-’, ‘_’.
Returns: Edited value.
Return type: str

pdm_utils.functions.basic.truncate_value(value, length, suffix)¶

Truncate a string.

Parameters

value (str) – String that should be truncated.
length (int) – Final length of truncated string.
suffix (str) – String that should be appended to truncated string.

Returns

the truncated string

Return type

str

pdm_utils.functions.basic.verify_path(filepath, kind=None)¶

Verifies that a given path exists.

Parameters

filepath (str) – full path to the desired file/directory.
kind (str) – (“file”, “dir”), corresponding with paths to be checked as either files or directories.

Return Boolean

True if path is verified, False otherwise.

pdm_utils.functions.basic.verify_path2(path, kind=None, expect=True)¶

Verifies that a given path exists.

Parameters

path (Path) – path
kind (str) – (“file”, “dir”), corresponding with paths to be checked as either files or directories.
expect (bool) – Indicates if the path is expected to the indicated kind.

Returns

tuple (result, message) WHERE result(bool) indicates if the expectation was satisfied. message(str) is a description of the result.

Return type

tuple