basic¶
Misc. base/simple functions. These should not require import of other modules in this package to prevent circular imports.
- pdm_utils.functions.basic.ask_yes_no(prompt='', response_attempt=1)¶
Function to get the user’s yes/no response to a question.
Accepts variations of yes/y, true/t, no/n, false/f, exit/quit/q.
- Parameters
prompt (str) – the question to ask the user.
response_attempt (int) – The number of the number of attempts allowed before the function exits. This prevents the script from getting stuck in a loop.
- Returns
The default is False (e.g. user hits Enter without typing anything else), but variations of yes or true responses will return True instead. If the response is ‘exit’ or ‘quit’, the loop is exited and None is returned.
- Return type
bool, None
- pdm_utils.functions.basic.check_empty(value, lower=True)¶
Checks if the value represents a null value.
- Parameters
value (misc.) – Value to be checked against the empty set.
lower (bool) – Indicates whether the input value should be lowercased prior to checking.
- Returns
Indicates whether the value is present in the empty set.
- Return type
bool
- pdm_utils.functions.basic.check_value_expected_in_set(value, set1, expect=True)¶
Check if a value is present within a set and if it is expected.
- Parameters
value (misc.) – The value to be checked.
set1 (set) – The reference set of values.
expect (bool) – Indicates if ‘value’ is expected to be present in ‘set1’.
- Returns
The result of the evaluation.
- Return type
bool
- pdm_utils.functions.basic.check_value_in_two_sets(value, set1, set2)¶
Check if a value is present within two sets.
- Parameters
value (misc.) – The value to be checked.
set1 (set) – The first reference set of values.
set2 (set) – The second reference set of values.
- Returns
The result of the evaluation, indicating whether the value is present within:
only the ‘first’ set
only the ‘second’ set
’both’ sets
’neither’ set
- Return type
str
- pdm_utils.functions.basic.choose_from_list(options)¶
Iterate through a list of values and choose a value.
- Parameters
options (list) – List of options to choose from.
- Returns
the user select option of None
- Return type
option or None
- pdm_utils.functions.basic.choose_most_common(string, values)¶
Identify most common occurrence of several values in a string.
- Parameters
string (str) – String to search.
values (list) – List of string characters. The order in the list indicates preference, in the case of a tie.
- Returns
Value from values that occurs most.
- Return type
str
- pdm_utils.functions.basic.clear_screen()¶
Brings the command line to the top of the screen.
- pdm_utils.functions.basic.compare_cluster_subcluster(cluster, subcluster)¶
Check if a cluster and subcluster designation are compatible.
- Parameters
cluster (str) – The cluster value to be compared. ‘Singleton’ and ‘UNK’ are lowercased.
subcluster (str) – The subcluster value to be compared.
- Returns
The result of the evaluation, indicating whether the two values are compatible.
- Return type
bool
- pdm_utils.functions.basic.compare_sets(set1, set2)¶
Compute the intersection and differences between two sets.
- Parameters
set1 (set) – The first input set.
set2 (set) – The second input set.
- Returns
tuple (set_intersection, set1_diff, set2_diff) WHERE set_intersection(set) is the set of shared values. set1_diff(set) is the set of values unique to the first set. set2_diff(set) is the set of values unique to the second set.
- Return type
tuple
- pdm_utils.functions.basic.convert_empty(input_value, format, upper=False)¶
Converts common null value formats.
- Parameters
input_value (str, int, datetime) – Value to be re-formatted.
format (str) – Indicates how the value should be edited. Valid format types include: ‘empty_string’ = ‘’ ‘none_string’ = ‘none’ ‘null_string’ = ‘null’ ‘none_object’ = None ‘na_long’ = ‘not applicable’ ‘na_short’ = ‘na’ ‘n/a’ = ‘n/a’ ‘zero_string’ = ‘0’ ‘zero_num’ = 0 ‘empty_datetime_obj’ = datetime object with arbitrary date, ‘1/1/0001’
upper (bool) – Indicates whether the output value should be uppercased.
- Returns
The re-formatted value as indicated by ‘format’.
- Return type
str, int, datetime
- pdm_utils.functions.basic.convert_list_to_dict(data_list, key)¶
Convert list of dictionaries to a dictionary of dictionaries
- Parameters
data_list (list) – List of dictionaries.
key (str) – key in each dictionary to become the returned dictionary key.
- Returns
Dictionary of all dictionaries. Returns an empty dictionary if all intended keys are not unique.
- Return type
dict
- pdm_utils.functions.basic.convert_to_decoded(values)¶
Converts a list of strings to utf-8 encoded values.
- Parameters
values (list[bytes]) – Byte values from MySQL queries to be decoded.
- Returns
List of utf-8 decoded values.
- Return type
list[str]
- pdm_utils.functions.basic.convert_to_encoded(values)¶
Converts a list of strings to utf-8 encoded values.
- Parameters
values (list[str]) – Strings for a MySQL query to be encoded.
- Returns
List of utf-8 encoded values.
- Return type
list[bytes]
- pdm_utils.functions.basic.create_indices(input_list, batch_size)¶
Create list of start and stop indices to split a list into batches.
- Parameters
input_list (list) – List from which to generate batch indices.
batch_size (int) – Size of each batch.
- Returns
List of 2-element tuples (start index, stop index).
- Return type
list
- pdm_utils.functions.basic.edit_suffix(value, option, suffix='_Draft')¶
Adds or removes the indicated suffix to an input value.
- Parameters
value (str) – Value that will be edited.
option (str) – Indicates what to do with the value and suffix (‘add’, ‘remove’).
suffix (str) – The suffix that will be added or removed.
- Returns
The edited value. The suffix is not added if the input value already has the suffix.
- Return type
str
- pdm_utils.functions.basic.expand_path(input_path)¶
Convert a non-absolute path into an absolute path.
- Parameters
input_path (str) – The path to be expanded.
- Returns
The expanded path.
- Return type
str
- pdm_utils.functions.basic.find_expression(expression, list_of_items)¶
Counts the number of items with matches to a regular expression.
- Parameters
expression (re) – Regular expression object
list_of_items (list) – List of items that will be searched with the regular expression.
- Returns
Number of times the regular expression was identified in the list.
- Return type
int
- pdm_utils.functions.basic.get_user_pwd(user_prompt='Username: ', pwd_prompt='Password: ')¶
Get username and password.
- Parameters
user_prompt (str) – Displayed description when prompted for username.
pwd_prompt (str) – Displayed description when prompted for password.
- Returns
tuple (username, password) WHERE username(str) is the user-supplied username. password(str) is the user-supplied password.
- Return type
tuple
- pdm_utils.functions.basic.get_values_from_dict_list(list_of_dicts)¶
Convert a list of dictionaries to a set of the dictionary values.
- Parameters
list_of_dicts (list) – List of dictionaries.
- Returns
Set of values from all dictionaries in the list.
- Return type
set
- pdm_utils.functions.basic.get_values_from_tuple_list(list_of_tuples)¶
Convert a list of tuples to a set of the tuple values.
- Parameters
list_of_tuples (list) – List of tuples.
- Returns
Set of values from all tuples in the list.
- Return type
set
- pdm_utils.functions.basic.identify_contents(path_to_folder, kind=None, ignore_set={})¶
Create a list of filenames and/or folders from an indicated directory.
- Parameters
path_to_folder (Path) – A valid directory path.
kind (str) – (“file”, “dir”), corresponding with paths to be checked as either files or directories.
ignore_set (set) – A set of strings representing file or folder names to ignore.
- Returns
List of valid contents in the directory.
- Return type
list
- pdm_utils.functions.basic.identify_nested_items(complete_list)¶
Identify nested and non-nested two-element tuples in a list.
- Parameters
complete_list (list) – List of tuples that will be evaluated.
- Returns
tuple (not_nested_set, nested_set) WHERE not_nested_set(set) is a set of non-nested tuples. nested_set(set) is a set of nested tuples.
- Return type
tuple
- pdm_utils.functions.basic.identify_one_list_duplicates(item_list)¶
Identify duplicate items within a list.
- Parameters
item_list (list) – The input list to be checked.
- Returns
The set of non-unique/duplicated items.
- Return type
set
- pdm_utils.functions.basic.identify_two_list_duplicates(item1_list, item2_list)¶
Identify duplicate items between two lists.
- Parameters
item1_list (list) – The first input list to be checked.
item2_list (list) – The second input list to be checked.
- Returns
The set of non-unique/duplicated items between the two lists (but not duplicate items within each list).
- Return type
set
- pdm_utils.functions.basic.identify_unique_items(complete_list)¶
Identify unique and non-unique items in a list.
- Parameters
complete_list (list) – List of items that will be evaluated.
- Returns
tuple (unique_set, duplicate_set) WHERE unique_set(set) is a set of all unique/non-duplicated items. duplicate_set(set) is a set of non-unique/duplicated items. non-informative/generic data is removed.
- Return type
tuple
- pdm_utils.functions.basic.increment_histogram(data, histogram)¶
Increments a dictionary histogram based on given data.
- Parameters
data (list) – Data to be used to index or create new keys in the histogram.
histogram (dict) – Dictionary containing keys whose values contain counts.
- pdm_utils.functions.basic.invert_dictionary(dictionary)¶
Inverts a dictionary, where the values and keys are swapped.
- Parameters
dictionary (dict) – A dictionary to be inverted.
- Returns
Returns an inverted dictionary of the given dictionary.
- Return type
dict
- pdm_utils.functions.basic.is_float(string)¶
Check if string can be converted to float.
- pdm_utils.functions.basic.join_strings(input_list, delimiter=' ')¶
Open file and retrieve a dictionary of data.
- Parameters
input_list (list) – List of values to join.
delimiter (str) – Delimiter used between values.
- Returns
Concatenated values, excluding all None and ‘’ values.
- Return type
str
- pdm_utils.functions.basic.lower_case(value)¶
Return the value lowercased if it is within a specific set of values.
- Parameters
value (str) – The value to be checked.
- Returns
The lowercased value if it is equivalent to ‘none’, ‘retrieve’, or ‘retain’.
- Return type
str
- pdm_utils.functions.basic.make_new_dir(output_dir, new_dir, attempt=1, mkdir=True)¶
Make a new directory.
Checks to verify the new directory name is valid and does not already exist. If it already exists, it attempts to extend the name with an integer suffix.
- Parameters
output_dir (Path) – Full path to the directory where the new directory will be created.
new_dir (Path) – Name of the new directory to be created.
attempt (int) – Number of attempts to create the directory.
- Returns
If successful, the full path of the created directory. If unsuccessful, None.
- Return type
Path, None
- pdm_utils.functions.basic.make_new_file(output_dir, new_file, ext, attempt=1)¶
Make a new file.
Checks to verify the new file name is valid and does not already exist. If it already exists, it attempts to extend the name with an integer suffix.
- Parameters
output_dir (Path) – Full path to the directory where the new directory will be created.
new_file (Path) – Name of the new file to be created.
ext (str) – Name of the file extension to be used.
attempt (int) – Number of attempts to create the file.
- Returns
If successful, the full path of the created file. If unsuccessful, None.
- Return type
Path, None
- pdm_utils.functions.basic.match_items(list1, list2)¶
Match values of two lists and return several results.
- Parameters
list1 (list) – The first input list.
list2 (list) – The second input list.
- Returns
tuple (matched_unique_items, set1_unmatched_unique_items, set2_unmatched_unique_items, set1_duplicate_items, set2_duplicate_items) WHERE matched_unique_items(set) is the set of matched unique values. set1_unmatched_unique_items(set) is the set of unmatched unique values from the first list. set2_unmatched_unique_items(set) is the set of unmatched unique values from the second list. set1_duplicate_items(set) is the the set of duplicate values from the first list. set2_duplicate_items(set) is the set of unmatched unique values from the second list.
- Return type
tuple
- pdm_utils.functions.basic.merge_set_dicts(dict1, dict2)¶
Merge two dictionaries of sets.
- Parameters
dict1 (dict) – First dictionary of sets.
dict2 (dict) – Second dictionary of sets.
- Returns
Merged dictionary containing all keys from both dictionaries, and for each shared key the value is a set of merged values.
- Return type
dict
- pdm_utils.functions.basic.parse_flag_file(flag_file)¶
Parse a file to an evaluation flag dictionary.
- Parameters
flag_file (str) – A two-column csv-formatted file WHERE 1. evaluation flag 2. ‘True’ or ‘False’
- Returns
A dictionary WHERE keys (str) are evaluation flags values (bool) indicate the flag setting Only flags that contain boolean values are returned.
- Return type
dict
- pdm_utils.functions.basic.parse_names_from_record_field(description)¶
Attempts to parse the phage/plasmid/prophage name and host genus from a given string. :param description: the input string to be parsed :type description: str :return: name, host_genus
- pdm_utils.functions.basic.partition_list(data_list, size)¶
Chunks list into a list of lists with the given size.
- Parameters
data_list (list) – List to be split into equal-sized lists.
size – Length of the resulting list chunks.
size – int
- Returns
Returns list of lists with length of the given size.
- Return type
list[list]
- pdm_utils.functions.basic.prepare_filepath(folder_path, file_name, folder_name=None)¶
Prepare path to new file.
- Parameters
folder_path (Path) – Path to the directory to contain the file.
file_name (str) – Name of the file.
folder_name (Path) – Name of sub-directory to create.
- Returns
Path to file in directory.
- Return type
Path
- pdm_utils.functions.basic.reformat_coordinates(start, stop, current, new)¶
Converts common coordinate formats.
The type of coordinate formats include:
‘0_half_open’:
0-based half-open intervals that is the common format for BAM files and UCSC Browser database. This format seems to be more efficient when performing genomics computations.
‘1_closed’:
1-based closed intervals that is the common format for the MySQL Database, UCSC Browser, the Ensembl genomics database, VCF files, GFF files. This format seems to be more intuitive and used for visualization.
The function assumes coordinates reflect the start and stop boundaries (where the start coordinates is smaller than the stop coordinate), instead of transcription start and stop coordinates.
- Parameters
start (int) – Start coordinate
stop (int) – Stop coordinate
current (str) – Indicates the indexing format of the input coordinates.
new (str) – Indicates the indexing format of the output coordinates.
- Returns
The re-formatted start and stop coordinates.
- Return type
int
- pdm_utils.functions.basic.reformat_description(raw_description)¶
Reformat a gene description.
- Parameters
raw_description (str) – Input value to be reformatted.
- Returns
tuple (description, processed_description) WHERE description(str) is the original value stripped of leading and trailing whitespace. processed_description(str) is the reformatted value, in which non-informative/generic data is removed.
- Return type
tuple
- pdm_utils.functions.basic.reformat_strand(input_value, format, case=False)¶
Converts common strand orientation formats.
- Parameters
input_value (str, int) – Value that will be edited.
format (str) – Indicates how the value should be edited. Valid format types include: ‘fr_long’ (‘forward’, ‘reverse’) ‘fr_short’ (‘f’, ‘r’) ‘fr_abbrev1’ (‘for’, ‘rev’) ‘fr_abbrev2’ (‘fwd’, ‘rev’) ‘tb_long’ (‘top’, ‘bottom’) ‘tb_short’ (‘t’, ‘b’) ‘wc_long’ (‘watson’, ‘crick’) ‘wc_short’ (‘w’,’c’) ‘operator’ (‘+’, ‘-‘) ‘numeric’ (1, -1).
case (bool) – Indicates whether the output value should be capitalized.
- Returns
The re-formatted value as indicated by ‘format’.
- Return type
str, int
- pdm_utils.functions.basic.select_option(prompt, valid_response_set)¶
Select an option from a set of options.
- Parameters
prompt (str) – Message to display before displaying option.
valid_response_set (set) – Set of valid options to choose.
- Returns
option
- Return type
str, int
- pdm_utils.functions.basic.set_path(path, kind=None, expect=True)¶
Confirm validity of path argument.
- Parameters
path (Path) – path
kind (str) – (“file”, “dir”), corresponding with paths to be checked as either files or directories.
expect (bool) – Indicates if the path is expected to the indicated kind.
- Returns
Absolute path if valid, otherwise sys.exit is called.
- Return type
Path
- pdm_utils.functions.basic.sort_histogram(histogram, descending=True)¶
Sorts a dictionary by its values and returns the sorted histogram.
- Parameters
histogram (dict) – Dictionary containing keys whose values contain counts.
- Returns
An ordered dict from items from the histogram sorted by value.
- Return type
OrderedDict
- pdm_utils.functions.basic.sort_histogram_keys(histogram, descending=True)¶
Sorts a dictionary by its values and returns the sorted histogram.
- Parameters
histogram (dict) – Dictionary containing keys whose values contain counts.
- Returns
A list from keys from the histogram sorted by value.
- Return type
list
- pdm_utils.functions.basic.split_string(string)¶
Split a string based on alphanumeric characters.
Iterates through a string, identifies the first position in which the character is a float, and creates two strings at this position.
- Parameters
string (str) – The value to be split.
- Returns
tuple (left, right) WHERE left(str) is the left portion of the input value prior to the first numeric character and only contains alphabetic characters (or will be ‘’). right(str) is the right portion of the input value after the first numeric character and only contains numeric characters (or will be ‘’).
- Return type
tuple
- pdm_utils.functions.basic.trim_characters(string)¶
Remove leading and trailing generic characters from a string.
- Parameters
string (str) – Value that will be trimmed. Characters that will be removed include: ‘.’, ‘,’, ‘;’, ‘-’, ‘_’.
- Returns
Edited value.
- Return type
str
- pdm_utils.functions.basic.truncate_value(value, length, suffix)¶
Truncate a string.
- Parameters
value (str) – String that should be truncated.
length (int) – Final length of truncated string.
suffix (str) – String that should be appended to truncated string.
- Returns
the truncated string
- Return type
str
- pdm_utils.functions.basic.verify_path(filepath, kind=None)¶
Verifies that a given path exists.
- Parameters
filepath (str) – full path to the desired file/directory.
kind (str) – (“file”, “dir”), corresponding with paths to be checked as either files or directories.
- Return Boolean
True if path is verified, False otherwise.
- pdm_utils.functions.basic.verify_path2(path, kind=None, expect=True)¶
Verifies that a given path exists.
- Parameters
path (Path) – path
kind (str) – (“file”, “dir”), corresponding with paths to be checked as either files or directories.
expect (bool) – Indicates if the path is expected to the indicated kind.
- Returns
tuple (result, message) WHERE result(bool) indicates if the expectation was satisfied. message(str) is a description of the result.
- Return type
tuple