pdm_utils.functions package¶
Submodules¶
pdm_utils.functions.annotation module¶
Functions to retrieve phage genome annotation data.
- pdm_utils.functions.annotation.get_adjacent_genes(alchemist, gene)¶
- pdm_utils.functions.annotation.get_annotations_from_genes(alchemist, geneids)¶
- pdm_utils.functions.annotation.get_count_adjacent_annotations_to_pham(alchemist, pham, incounts=None)¶
- pdm_utils.functions.annotation.get_count_adjacent_phams_to_pham(alchemist, pham, incounts=None)¶
- pdm_utils.functions.annotation.get_count_annotations_in_genes(alchemist, geneids, incounts=None)¶
- pdm_utils.functions.annotation.get_count_annotations_in_pham(alchemist, pham, incounts=None)¶
- pdm_utils.functions.annotation.get_count_phams_in_genes(alchemist, geneids, incounts=None)¶
- pdm_utils.functions.annotation.get_distinct_adjacent_phams(alchemist, pham)¶
- pdm_utils.functions.annotation.get_distinct_annotations_from_genes(alchemist, geneids)¶
- pdm_utils.functions.annotation.get_distinct_phams_from_genes(alchemist, geneids)¶
- pdm_utils.functions.annotation.get_genes_adjacent_to_pham(alchemist, pham)¶
- pdm_utils.functions.annotation.get_genes_from_pham(alchemist, pham)¶
- pdm_utils.functions.annotation.get_phams_from_genes(alchemist, geneids)¶
- pdm_utils.functions.annotation.get_relative_gene(alchemist, geneid, pos)¶
pdm_utils.functions.basic module¶
Misc. base/simple functions. These should not require import of other modules in this package to prevent circular imports.
- pdm_utils.functions.basic.ask_yes_no(prompt='', response_attempt=1)¶
Function to get the user’s yes/no response to a question.
Accepts variations of yes/y, true/t, no/n, false/f, exit/quit/q.
- Parameters
prompt (str) – the question to ask the user.
response_attempt (int) – The number of the number of attempts allowed before the function exits. This prevents the script from getting stuck in a loop.
- Returns
The default is False (e.g. user hits Enter without typing anything else), but variations of yes or true responses will return True instead. If the response is ‘exit’ or ‘quit’, the loop is exited and None is returned.
- Return type
bool, None
- pdm_utils.functions.basic.check_empty(value, lower=True)¶
Checks if the value represents a null value.
- Parameters
value (misc.) – Value to be checked against the empty set.
lower (bool) – Indicates whether the input value should be lowercased prior to checking.
- Returns
Indicates whether the value is present in the empty set.
- Return type
bool
- pdm_utils.functions.basic.check_value_expected_in_set(value, set1, expect=True)¶
Check if a value is present within a set and if it is expected.
- Parameters
value (misc.) – The value to be checked.
set1 (set) – The reference set of values.
expect (bool) – Indicates if ‘value’ is expected to be present in ‘set1’.
- Returns
The result of the evaluation.
- Return type
bool
- pdm_utils.functions.basic.check_value_in_two_sets(value, set1, set2)¶
Check if a value is present within two sets.
- Parameters
value (misc.) – The value to be checked.
set1 (set) – The first reference set of values.
set2 (set) – The second reference set of values.
- Returns
The result of the evaluation, indicating whether the value is present within:
only the ‘first’ set
only the ‘second’ set
’both’ sets
’neither’ set
- Return type
str
- pdm_utils.functions.basic.choose_from_list(options)¶
Iterate through a list of values and choose a value.
- Parameters
options (list) – List of options to choose from.
- Returns
the user select option of None
- Return type
option or None
- pdm_utils.functions.basic.choose_most_common(string, values)¶
Identify most common occurrence of several values in a string.
- Parameters
string (str) – String to search.
values (list) – List of string characters. The order in the list indicates preference, in the case of a tie.
- Returns
Value from values that occurs most.
- Return type
str
- pdm_utils.functions.basic.clear_screen()¶
Brings the command line to the top of the screen.
- pdm_utils.functions.basic.compare_cluster_subcluster(cluster, subcluster)¶
Check if a cluster and subcluster designation are compatible.
- Parameters
cluster (str) – The cluster value to be compared. ‘Singleton’ and ‘UNK’ are lowercased.
subcluster (str) – The subcluster value to be compared.
- Returns
The result of the evaluation, indicating whether the two values are compatible.
- Return type
bool
- pdm_utils.functions.basic.compare_sets(set1, set2)¶
Compute the intersection and differences between two sets.
- Parameters
set1 (set) – The first input set.
set2 (set) – The second input set.
- Returns
tuple (set_intersection, set1_diff, set2_diff) WHERE set_intersection(set) is the set of shared values. set1_diff(set) is the set of values unique to the first set. set2_diff(set) is the set of values unique to the second set.
- Return type
tuple
- pdm_utils.functions.basic.convert_empty(input_value, format, upper=False)¶
Converts common null value formats.
- Parameters
input_value (str, int, datetime) – Value to be re-formatted.
format (str) – Indicates how the value should be edited. Valid format types include: ‘empty_string’ = ‘’ ‘none_string’ = ‘none’ ‘null_string’ = ‘null’ ‘none_object’ = None ‘na_long’ = ‘not applicable’ ‘na_short’ = ‘na’ ‘n/a’ = ‘n/a’ ‘zero_string’ = ‘0’ ‘zero_num’ = 0 ‘empty_datetime_obj’ = datetime object with arbitrary date, ‘1/1/0001’
upper (bool) – Indicates whether the output value should be uppercased.
- Returns
The re-formatted value as indicated by ‘format’.
- Return type
str, int, datetime
- pdm_utils.functions.basic.convert_list_to_dict(data_list, key)¶
Convert list of dictionaries to a dictionary of dictionaries
- Parameters
data_list (list) – List of dictionaries.
key (str) – key in each dictionary to become the returned dictionary key.
- Returns
Dictionary of all dictionaries. Returns an empty dictionary if all intended keys are not unique.
- Return type
dict
- pdm_utils.functions.basic.convert_to_decoded(values)¶
Converts a list of strings to utf-8 encoded values.
- Parameters
values (list[bytes]) – Byte values from MySQL queries to be decoded.
- Returns
List of utf-8 decoded values.
- Return type
list[str]
- pdm_utils.functions.basic.convert_to_encoded(values)¶
Converts a list of strings to utf-8 encoded values.
- Parameters
values (list[str]) – Strings for a MySQL query to be encoded.
- Returns
List of utf-8 encoded values.
- Return type
list[bytes]
- pdm_utils.functions.basic.create_indices(input_list, batch_size)¶
Create list of start and stop indices to split a list into batches.
- Parameters
input_list (list) – List from which to generate batch indices.
batch_size (int) – Size of each batch.
- Returns
List of 2-element tuples (start index, stop index).
- Return type
list
- pdm_utils.functions.basic.edit_suffix(value, option, suffix='_Draft')¶
Adds or removes the indicated suffix to an input value.
- Parameters
value (str) – Value that will be edited.
option (str) – Indicates what to do with the value and suffix (‘add’, ‘remove’).
suffix (str) – The suffix that will be added or removed.
- Returns
The edited value. The suffix is not added if the input value already has the suffix.
- Return type
str
- pdm_utils.functions.basic.expand_path(input_path)¶
Convert a non-absolute path into an absolute path.
- Parameters
input_path (str) – The path to be expanded.
- Returns
The expanded path.
- Return type
str
- pdm_utils.functions.basic.find_expression(expression, list_of_items)¶
Counts the number of items with matches to a regular expression.
- Parameters
expression (re) – Regular expression object
list_of_items (list) – List of items that will be searched with the regular expression.
- Returns
Number of times the regular expression was identified in the list.
- Return type
int
- pdm_utils.functions.basic.get_user_pwd(user_prompt='Username: ', pwd_prompt='Password: ')¶
Get username and password.
- Parameters
user_prompt (str) – Displayed description when prompted for username.
pwd_prompt (str) – Displayed description when prompted for password.
- Returns
tuple (username, password) WHERE username(str) is the user-supplied username. password(str) is the user-supplied password.
- Return type
tuple
- pdm_utils.functions.basic.get_values_from_dict_list(list_of_dicts)¶
Convert a list of dictionaries to a set of the dictionary values.
- Parameters
list_of_dicts (list) – List of dictionaries.
- Returns
Set of values from all dictionaries in the list.
- Return type
set
- pdm_utils.functions.basic.get_values_from_tuple_list(list_of_tuples)¶
Convert a list of tuples to a set of the tuple values.
- Parameters
list_of_tuples (list) – List of tuples.
- Returns
Set of values from all tuples in the list.
- Return type
set
- pdm_utils.functions.basic.identify_contents(path_to_folder, kind=None, ignore_set={})¶
Create a list of filenames and/or folders from an indicated directory.
- Parameters
path_to_folder (Path) – A valid directory path.
kind (str) – (“file”, “dir”), corresponding with paths to be checked as either files or directories.
ignore_set (set) – A set of strings representing file or folder names to ignore.
- Returns
List of valid contents in the directory.
- Return type
list
- pdm_utils.functions.basic.identify_nested_items(complete_list)¶
Identify nested and non-nested two-element tuples in a list.
- Parameters
complete_list (list) – List of tuples that will be evaluated.
- Returns
tuple (not_nested_set, nested_set) WHERE not_nested_set(set) is a set of non-nested tuples. nested_set(set) is a set of nested tuples.
- Return type
tuple
- pdm_utils.functions.basic.identify_one_list_duplicates(item_list)¶
Identify duplicate items within a list.
- Parameters
item_list (list) – The input list to be checked.
- Returns
The set of non-unique/duplicated items.
- Return type
set
- pdm_utils.functions.basic.identify_two_list_duplicates(item1_list, item2_list)¶
Identify duplicate items between two lists.
- Parameters
item1_list (list) – The first input list to be checked.
item2_list (list) – The second input list to be checked.
- Returns
The set of non-unique/duplicated items between the two lists (but not duplicate items within each list).
- Return type
set
- pdm_utils.functions.basic.identify_unique_items(complete_list)¶
Identify unique and non-unique items in a list.
- Parameters
complete_list (list) – List of items that will be evaluated.
- Returns
tuple (unique_set, duplicate_set) WHERE unique_set(set) is a set of all unique/non-duplicated items. duplicate_set(set) is a set of non-unique/duplicated items. non-informative/generic data is removed.
- Return type
tuple
- pdm_utils.functions.basic.increment_histogram(data, histogram)¶
Increments a dictionary histogram based on given data.
- Parameters
data (list) – Data to be used to index or create new keys in the histogram.
histogram (dict) – Dictionary containing keys whose values contain counts.
- pdm_utils.functions.basic.invert_dictionary(dictionary)¶
Inverts a dictionary, where the values and keys are swapped.
- Parameters
dictionary (dict) – A dictionary to be inverted.
- Returns
Returns an inverted dictionary of the given dictionary.
- Return type
dict
- pdm_utils.functions.basic.is_float(string)¶
Check if string can be converted to float.
- pdm_utils.functions.basic.join_strings(input_list, delimiter=' ')¶
Open file and retrieve a dictionary of data.
- Parameters
input_list (list) – List of values to join.
delimiter (str) – Delimiter used between values.
- Returns
Concatenated values, excluding all None and ‘’ values.
- Return type
str
- pdm_utils.functions.basic.lower_case(value)¶
Return the value lowercased if it is within a specific set of values.
- Parameters
value (str) – The value to be checked.
- Returns
The lowercased value if it is equivalent to ‘none’, ‘retrieve’, or ‘retain’.
- Return type
str
- pdm_utils.functions.basic.make_new_dir(output_dir, new_dir, attempt=1, mkdir=True)¶
Make a new directory.
Checks to verify the new directory name is valid and does not already exist. If it already exists, it attempts to extend the name with an integer suffix.
- Parameters
output_dir (Path) – Full path to the directory where the new directory will be created.
new_dir (Path) – Name of the new directory to be created.
attempt (int) – Number of attempts to create the directory.
- Returns
If successful, the full path of the created directory. If unsuccessful, None.
- Return type
Path, None
- pdm_utils.functions.basic.make_new_file(output_dir, new_file, ext, attempt=1)¶
Make a new file.
Checks to verify the new file name is valid and does not already exist. If it already exists, it attempts to extend the name with an integer suffix.
- Parameters
output_dir (Path) – Full path to the directory where the new directory will be created.
new_file (Path) – Name of the new file to be created.
ext (str) – Name of the file extension to be used.
attempt (int) – Number of attempts to create the file.
- Returns
If successful, the full path of the created file. If unsuccessful, None.
- Return type
Path, None
- pdm_utils.functions.basic.match_items(list1, list2)¶
Match values of two lists and return several results.
- Parameters
list1 (list) – The first input list.
list2 (list) – The second input list.
- Returns
tuple (matched_unique_items, set1_unmatched_unique_items, set2_unmatched_unique_items, set1_duplicate_items, set2_duplicate_items) WHERE matched_unique_items(set) is the set of matched unique values. set1_unmatched_unique_items(set) is the set of unmatched unique values from the first list. set2_unmatched_unique_items(set) is the set of unmatched unique values from the second list. set1_duplicate_items(set) is the the set of duplicate values from the first list. set2_duplicate_items(set) is the set of unmatched unique values from the second list.
- Return type
tuple
- pdm_utils.functions.basic.merge_set_dicts(dict1, dict2)¶
Merge two dictionaries of sets.
- Parameters
dict1 (dict) – First dictionary of sets.
dict2 (dict) – Second dictionary of sets.
- Returns
Merged dictionary containing all keys from both dictionaries, and for each shared key the value is a set of merged values.
- Return type
dict
- pdm_utils.functions.basic.parse_flag_file(flag_file)¶
Parse a file to an evaluation flag dictionary.
- Parameters
flag_file (str) – A two-column csv-formatted file WHERE 1. evaluation flag 2. ‘True’ or ‘False’
- Returns
A dictionary WHERE keys (str) are evaluation flags values (bool) indicate the flag setting Only flags that contain boolean values are returned.
- Return type
dict
- pdm_utils.functions.basic.parse_names_from_record_field(description)¶
Attempts to parse the phage/plasmid/prophage name and host genus from a given string. :param description: the input string to be parsed :type description: str :return: name, host_genus
- pdm_utils.functions.basic.partition_list(data_list, size)¶
Chunks list into a list of lists with the given size.
- Parameters
data_list (list) – List to be split into equal-sized lists.
size – Length of the resulting list chunks.
size – int
- Returns
Returns list of lists with length of the given size.
- Return type
list[list]
- pdm_utils.functions.basic.prepare_filepath(folder_path, file_name, folder_name=None)¶
Prepare path to new file.
- Parameters
folder_path (Path) – Path to the directory to contain the file.
file_name (str) – Name of the file.
folder_name (Path) – Name of sub-directory to create.
- Returns
Path to file in directory.
- Return type
Path
- pdm_utils.functions.basic.reformat_coordinates(start, stop, current, new)¶
Converts common coordinate formats.
The type of coordinate formats include:
‘0_half_open’:
0-based half-open intervals that is the common format for BAM files and UCSC Browser database. This format seems to be more efficient when performing genomics computations.
‘1_closed’:
1-based closed intervals that is the common format for the MySQL Database, UCSC Browser, the Ensembl genomics database, VCF files, GFF files. This format seems to be more intuitive and used for visualization.
The function assumes coordinates reflect the start and stop boundaries (where the start coordinates is smaller than the stop coordinate), instead of transcription start and stop coordinates.
- Parameters
start (int) – Start coordinate
stop (int) – Stop coordinate
current (str) – Indicates the indexing format of the input coordinates.
new (str) – Indicates the indexing format of the output coordinates.
- Returns
The re-formatted start and stop coordinates.
- Return type
int
- pdm_utils.functions.basic.reformat_description(raw_description)¶
Reformat a gene description.
- Parameters
raw_description (str) – Input value to be reformatted.
- Returns
tuple (description, processed_description) WHERE description(str) is the original value stripped of leading and trailing whitespace. processed_description(str) is the reformatted value, in which non-informative/generic data is removed.
- Return type
tuple
- pdm_utils.functions.basic.reformat_strand(input_value, format, case=False)¶
Converts common strand orientation formats.
- Parameters
input_value (str, int) – Value that will be edited.
format (str) – Indicates how the value should be edited. Valid format types include: ‘fr_long’ (‘forward’, ‘reverse’) ‘fr_short’ (‘f’, ‘r’) ‘fr_abbrev1’ (‘for’, ‘rev’) ‘fr_abbrev2’ (‘fwd’, ‘rev’) ‘tb_long’ (‘top’, ‘bottom’) ‘tb_short’ (‘t’, ‘b’) ‘wc_long’ (‘watson’, ‘crick’) ‘wc_short’ (‘w’,’c’) ‘operator’ (‘+’, ‘-‘) ‘numeric’ (1, -1).
case (bool) – Indicates whether the output value should be capitalized.
- Returns
The re-formatted value as indicated by ‘format’.
- Return type
str, int
- pdm_utils.functions.basic.select_option(prompt, valid_response_set)¶
Select an option from a set of options.
- Parameters
prompt (str) – Message to display before displaying option.
valid_response_set (set) – Set of valid options to choose.
- Returns
option
- Return type
str, int
- pdm_utils.functions.basic.set_path(path, kind=None, expect=True)¶
Confirm validity of path argument.
- Parameters
path (Path) – path
kind (str) – (“file”, “dir”), corresponding with paths to be checked as either files or directories.
expect (bool) – Indicates if the path is expected to the indicated kind.
- Returns
Absolute path if valid, otherwise sys.exit is called.
- Return type
Path
- pdm_utils.functions.basic.sort_histogram(histogram, descending=True)¶
Sorts a dictionary by its values and returns the sorted histogram.
- Parameters
histogram (dict) – Dictionary containing keys whose values contain counts.
- Returns
An ordered dict from items from the histogram sorted by value.
- Return type
OrderedDict
- pdm_utils.functions.basic.sort_histogram_keys(histogram, descending=True)¶
Sorts a dictionary by its values and returns the sorted histogram.
- Parameters
histogram (dict) – Dictionary containing keys whose values contain counts.
- Returns
A list from keys from the histogram sorted by value.
- Return type
list
- pdm_utils.functions.basic.split_string(string)¶
Split a string based on alphanumeric characters.
Iterates through a string, identifies the first position in which the character is a float, and creates two strings at this position.
- Parameters
string (str) – The value to be split.
- Returns
tuple (left, right) WHERE left(str) is the left portion of the input value prior to the first numeric character and only contains alphabetic characters (or will be ‘’). right(str) is the right portion of the input value after the first numeric character and only contains numeric characters (or will be ‘’).
- Return type
tuple
- pdm_utils.functions.basic.trim_characters(string)¶
Remove leading and trailing generic characters from a string.
- Parameters
string (str) – Value that will be trimmed. Characters that will be removed include: ‘.’, ‘,’, ‘;’, ‘-’, ‘_’.
- Returns
Edited value.
- Return type
str
- pdm_utils.functions.basic.truncate_value(value, length, suffix)¶
Truncate a string.
- Parameters
value (str) – String that should be truncated.
length (int) – Final length of truncated string.
suffix (str) – String that should be appended to truncated string.
- Returns
the truncated string
- Return type
str
- pdm_utils.functions.basic.verify_path(filepath, kind=None)¶
Verifies that a given path exists.
- Parameters
filepath (str) – full path to the desired file/directory.
kind (str) – (“file”, “dir”), corresponding with paths to be checked as either files or directories.
- Return Boolean
True if path is verified, False otherwise.
- pdm_utils.functions.basic.verify_path2(path, kind=None, expect=True)¶
Verifies that a given path exists.
- Parameters
path (Path) – path
kind (str) – (“file”, “dir”), corresponding with paths to be checked as either files or directories.
expect (bool) – Indicates if the path is expected to the indicated kind.
- Returns
tuple (result, message) WHERE result(bool) indicates if the expectation was satisfied. message(str) is a description of the result.
- Return type
tuple
pdm_utils.functions.cartography module¶
- pdm_utils.functions.cartography.get_map(mapper, table)¶
Get SQLAlchemy ORM map object.
- Parameters
mapper (DeclarativeMeta) – Connected and prepared SQLAlchemy automap base object.
table (str) – Case-insensitive table to retrieve a ORM map for.
- Returns
SQLAlchemy mapped object.
- Return type
DeclarativeMeta
- pdm_utils.functions.cartography.map_cds(metadata)¶
- pdm_utils.functions.cartography.map_genome(metadata)¶
pdm_utils.functions.configfile module¶
Configuration file definition and parsing.
- pdm_utils.functions.configfile.build_complete_config(file)¶
Buid a complete config object by merging user-supplied and default config.
- pdm_utils.functions.configfile.create_empty_config_file(dir, file, null_value)¶
Create an empty config file with all available settings.
- pdm_utils.functions.configfile.default_parser(null_value)¶
Constructs complete config with empty values.
- pdm_utils.functions.configfile.default_sections_keys()¶
- pdm_utils.functions.configfile.parse_config(file, parser=None)¶
Get parameters from config file.
- pdm_utils.functions.configfile.setup_section(keys, value)¶
- pdm_utils.functions.configfile.write_config(parser, filepath)¶
Write a ConfigParser to file.
pdm_utils.functions.eval_modes module¶
Evaluation mode functions and dictionaries.
- pdm_utils.functions.eval_modes.get_eval_flag_dict(eval_mode)¶
Get a dictionary of evaluation flags.
- Parameters
eval_mode (str) – Valid evaluation mode (base, draft, final, auto, misc, custom)
- Returns
Dictionary of boolean values.
- Return type
dict
pdm_utils.functions.fileio module¶
- pdm_utils.functions.fileio.export_data_dict(data_dicts, file_path, headers, include_headers=False)¶
Save a dictionary of data to file using specified column headers.
Ensures the output file contains a specified number of columns, and it ensures the column headers are exported as well.
- Parameters
data_dicts (list) – list of elements, where each element is a dictionary.
file_path (Path) – Path to file to export data.
headers (list) – List of strings to define the column order in the file. If include_headers is selected, the first row of the file will contain each string.
include_headers (bool) – Indicates whether the file should contain a row of column names derived from the headers parameter.
- pdm_utils.functions.fileio.parse_feature_table(filehandle)¶
Takes a (five-column) feature table(s) file handle and parses the data.
- Parameters
filehandle – Handle for a five-column formatted feature table file:
- Returns
Returns a feature table file parser generator.
- Return type
FeatureTableFileParser
- pdm_utils.functions.fileio.read_feature_table(filehandle)¶
Reads a (five-column) feature table and parses the data into a seqrecord.
- Parameters
filepath (Path) – Path to the five-column formatted feature table file.
- Returns
Returns a Biopython SeqRecord object with the table data.
- Return type
SeqRecord
- pdm_utils.functions.fileio.reintroduce_fasta_duplicates(ts_to_gs, filepath)¶
Reads a fasta file and reintroduces (rewrittes) duplicate sequences guided by an ungapped translation to sequence-id map
- Parameters
filepath (pathlib.Path) – Path to fasta-formatted multiple sequence file
ts_to_gs (dict) – Dictionary mapping unique translations to sequence-ids
- pdm_utils.functions.fileio.retrieve_data_dict(filepath)¶
Open file and retrieve a dictionary of data.
- Parameters
filepath (Path) – Path to file containing data and column names.
- Returns
A list of elements, where each element is a dictionary representing one row of data. Each key is a column name and each value is the data stored in that field.
- Return type
list
- pdm_utils.functions.fileio.write_database(alchemist, version, export_path, db_name=None)¶
Output .sql file from the selected database.
- Parameters
alchemist (AlchemyHandler) – A connected and fully built AlchemyHandler object.
version (int) – Database version information.
export_path (Path) – Path to a valid dir for file creation.
- pdm_utils.functions.fileio.write_fasta(ids_seqs, infile_path, name=None)¶
Writes the input genes to the indicated file in FASTA multiple sequence format (unaligned). :param id_seqs: the ids and sequences to be written to file :type genes: dict :param infile_path: the path of the file to write the genes to :type infile: Path :type infile: str
- pdm_utils.functions.fileio.write_feature_table(seqrecord_list, export_path, verbose=False)¶
Outputs files as five_column tab-delimited text files.
- Parameters
seq_record_list (list[SeqRecord]) – List of populated SeqRecords.
export_path (Path) – Path to a dir for file creation.
verbose (bool) – A boolean value to toggle progress print statements.
- pdm_utils.functions.fileio.write_seqrecord(seqrecord, file_path, file_format)¶
- pdm_utils.functions.fileio.write_seqrecords(seqrecord_list, file_format, export_path, export_name=None, concatenate=False, threads=1, verbose=False)¶
Outputs files with a particuar format from a SeqRecord list.
- Parameters
seq_record_list (list[SeqRecord]) – List of populated SeqRecords.
file_format (str) – Biopython supported file type.
export_path (Path) – Path to a dir for file creation.
concatenate – A boolean to toggle concatenation of SeqRecords.
verbose (bool) – A boolean value to toggle progress print statements.
pdm_utils.functions.flat_files module¶
Functions to interact with, use, and parse genomic data from GenBank-formatted flat files.
- pdm_utils.functions.flat_files.cds_to_seqrecord(cds, parent_genome, gene_domains=[], desc_type='gb')¶
Creates a SeqRecord object from a Cds and its parent Genome.
- Parameters
cds (Cds) – A populated Cds object.
phage_genome – Populated parent Genome object of the Cds object.
domains (list) – List of domain objects populated with column attributes
desc_type (str) – Inteneded format of the CDS SeqRecord description.
- Returns
Filled Biopython SeqRecord object.
- Return type
SeqRecord
- pdm_utils.functions.flat_files.create_fasta_seqrecord(header, sequence_string)¶
Create a fasta-formatted Biopython SeqRecord object.
- Parameters
header (str) – Description of the sequence.
sequence_string (str) – Nucleotide sequence.
- Returns
Biopython SeqRecord containing the nucleotide sequence.
- Return type
SeqRecord
- pdm_utils.functions.flat_files.create_seqfeature_dictionary(seqfeature_list)¶
Create a dictionary of Biopython SeqFeature objects based on their type.
From a list of all Biopython SeqFeatures derived from a GenBank-formatted flat file, create a dictionary of SeqFeatures based on their ‘type’ attribute.
- Parameters
seqfeature_list (list) – List of Biopython SeqFeatures
genome_id (str) – An identifier for the genome in which the seqfeature is defined.
- Returns
A dictionary of Biopython SeqFeatures: Key: SeqFeature type (source, tRNA, CDS, other) Value: SeqFeature
- Return type
dict
- pdm_utils.functions.flat_files.format_cds_seqrecord_CDS_feature(cds_feature, cds, parent_genome)¶
- pdm_utils.functions.flat_files.genome_to_seqrecord(phage_genome)¶
Creates a SeqRecord object from a pdm_utils Genome object.
- Parameters
phage_genome (Genome) – A pdm_utils Genome object.
- Returns
A BioPython SeqRecord object
- Return type
SeqRecord
- pdm_utils.functions.flat_files.get_cds_seqrecord_annotations(cds, parent_genome)¶
Function that creates a Cds SeqRecord annotations attribute dict. :param cds: A populated Cds object. :type cds: Cds :param phage_genome: Populated parent Genome object of the Cds object. :type phage_genome: Genome :returns: Formatted SeqRecord annotations dictionary. :rtype: dict{str}
- pdm_utils.functions.flat_files.get_cds_seqrecord_annotations_comments(cds)¶
Function that creates a Cds SeqRecord comments attribute tuple.
- Parameters
cds –
- pdm_utils.functions.flat_files.get_cds_seqrecord_regions(gene_domains, cds)¶
- pdm_utils.functions.flat_files.get_genome_seqrecord_annotations(phage_genome)¶
Helper function that uses Genome data to populate the annotations SeqRecord attribute
- Parameters
phage_genome (genome) – Input a Genome object.
- Returns
annotations(dictionary) is a dictionary with the formatting of BioPython’s SeqRecord annotations attribute
- pdm_utils.functions.flat_files.get_genome_seqrecord_annotations_comments(phage_genome)¶
Helper function that uses Genome data to populate the comment annotation attribute
- Parameters
phage_genome (genome) – Input a Genome object.
- Returns
cluster_comment, auto_generated_comment annotation_status_comment, qc_and_retrieval values (tuple) is a tuple with the formatting of BioPython’s SeqRecord annotations comment attribute
- pdm_utils.functions.flat_files.get_genome_seqrecord_description(phage_genome)¶
Helper function to construct a description SeqRecord attribute.
- Parameters
phage_genome (genome) – Input a Genome object.
- Returns
description is a formatted string parsed from genome data
- pdm_utils.functions.flat_files.get_genome_seqrecord_features(phage_genome)¶
Helper function that uses Genome data to populate the features SeqRecord atribute
- Parameters
phage_genome (genome) – Input a Genome object.
- Returns
features is a list of SeqFeature objects parsed from cds objects
- pdm_utils.functions.flat_files.parse_cds_seqfeature(seqfeature)¶
Parse data from a Biopython CDS SeqFeature object into a Cds object.
- Parameters
seqfeature (SeqFeature) – Biopython SeqFeature
genome_id (str) – An identifier for the genome in which the seqfeature is defined.
- Returns
A pdm_utils Cds object
- Return type
- pdm_utils.functions.flat_files.parse_coordinates(seqfeature)¶
Parse the boundary coordinates from a GenBank-formatted flat file.
The functions takes a Biopython SeqFeature object containing data that was parsed from the feature in the flat file. Parsing these coordinates can be tricky. There can be more than one set of coordinates if it is a compound location. Only features with 1 or 2 open reading frames (parts) are correctly parsed. Also, the boundaries may not be precise; instead they may be open or fuzzy. Non-precise coordinates are converted to ‘-1’. If the strand is undefined, the coordinates are converted to ‘-1’ and parts is set to ‘0’. If an incorrect data type is provided, coorindates are set to ‘-1’ and parts is set to ‘0’.
- Parameters
seqfeature (SeqFeature) – Biopython SeqFeature
- Returns
tuple (start, stop, parts) WHERE start(int) is the first coordinate, regardless of strand. stop(int) is the second coordinate, regardless of strand. parts(int) is the number of open reading frames that define the feature.
- pdm_utils.functions.flat_files.parse_genome_data(seqrecord, filepath=PosixPath('.'), translation_table=11, genome_id_field='_organism_name', gnm_type='', host_genus_field='_organism_host_genus')¶
Parse data from a Biopython SeqRecord object into a Genome object.
All Source, CDS, tRNA, and tmRNA features are parsed into their associated Source, Cds, Trna, and Tmrna objects.
- Parameters
seqrecord (SeqRecord) – A Biopython SeqRecord object.
filepath (Path) – A filename associated with the returned Genome object.
translation_table (int) – The applicable translation table for the genome’s CDS features.
genome_id_field (str) – The SeqRecord attribute from which the unique genome identifier/name is stored.
host_genus_field (str) – The SeqRecord attribute from which the unique host genus identifier/name is stored.
gnm_type (str) – Identifier for the type of genome.
- Returns
A pdm_utils Genome object.
- Return type
- pdm_utils.functions.flat_files.parse_source_seqfeature(seqfeature)¶
Parses a Biopython Source SeqFeature.
- Parameters
seqfeature (SeqFeature) – Biopython SeqFeature
genome_id (str) – An identifier for the genome in which the seqfeature is defined.
- Returns
A pdm_utils Source object
- Return type
- pdm_utils.functions.flat_files.parse_tmrna_seqfeature(seqfeature)¶
Parses data from a BioPython tmRNA SeqFeature object into a Tmrna object. :param seqfeature: BioPython SeqFeature :type seqfeature: SeqFeature :return: pdm_utils Tmrna object :rtype: Tmrna
- pdm_utils.functions.flat_files.parse_trna_seqfeature(seqfeature)¶
Parse data from a Biopython tRNA SeqFeature object into a Trna object. :param seqfeature: Biopython SeqFeature :type seqfeature: SeqFeature :returns: a pdm_utils Trna object :rtype: Trna
- pdm_utils.functions.flat_files.retrieve_genome_data(filepath)¶
Retrieve data from a GenBank-formatted flat file.
- Parameters
filepath (Path) – Path to GenBank-formatted flat file that will be parsed using Biopython.
- Returns
If there is only one record, a Biopython SeqRecord of parsed data. If the file cannot be parsed, or if there are multiple records, None value is returned.
- Return type
SeqRecord
- pdm_utils.functions.flat_files.sort_seqrecord_features(seqrecord)¶
Function that sorts and processes the seqfeature objects of a seqrecord.
- Parameters
seqrecord (SeqRecord) – Phage genome Biopython seqrecord object
pdm_utils.functions.mysqldb module¶
Functions to interact with MySQL.
- pdm_utils.functions.mysqldb.change_version(engine, amount=1)¶
Change the database version number.
- Parameters
engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.
amount (int) – Amount to increment/decrement version number.
- pdm_utils.functions.mysqldb.check_schema_compatibility(engine, pipeline, code_version=None)¶
Confirm database schema is compatible with code.
If schema version is not compatible, sys.exit is called.
- Parameters
engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.
pipeline (str) – Description of the pipeline checking compatibility.
code_version (int) – Schema version on which the pipeline operates. If no schema version is provided, the package-wide schema version value is used.
- pdm_utils.functions.mysqldb.create_delete(table, field, data)¶
Create MySQL DELETE statement.
“‘DELETE FROM <table> WHERE <field> = ‘<data>’.”
- Parameters
table (str) – The database table to insert information.
field (str) – The column upon which the statement is conditioned.
data (str) – The value of ‘field’ upon which the statement is conditioned.
- Returns
A MySQL DELETE statement.
- Return type
str
- pdm_utils.functions.mysqldb.create_gene_table_insert(cds_ftr)¶
Create a MySQL gene table INSERT statement.
- Parameters
cds_ftr (Cds) – A pdm_utils Cds object.
- Returns
A MySQL statement to INSERT a new row in the ‘gene’ table with data for several fields.
- Return type
str
- pdm_utils.functions.mysqldb.create_genome_statements(gnm, tkt_type='')¶
Create list of MySQL statements based on the ticket type.
- Parameters
gnm (Genome) – A pdm_utils Genome object.
tkt_type (str) – ‘add’ or ‘replace’.
- Returns
List of MySQL statements to INSERT all data from a genome into the database (DELETE FROM genome, INSERT INTO phage, INSERT INTO gene, …).
- Return type
list
- pdm_utils.functions.mysqldb.create_phage_table_insert(gnm)¶
Create a MySQL phage table INSERT statement.
- Parameters
gnm (Genome) – A pdm_utils Genome object.
- Returns
A MySQL statement to INSERT a new row in the ‘phage’ table with data for several fields.
- Return type
str
- pdm_utils.functions.mysqldb.create_seq_set(engine)¶
Create set of genome sequences currently in a MySQL database.
- Parameters
engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.
- Returns
A set of unique values from phage.Sequence.
- Return type
set
- pdm_utils.functions.mysqldb.create_tmrna_table_insert(tmrna_ftr)¶
- Parameters
tmrna_ftr –
- Returns
- pdm_utils.functions.mysqldb.create_trna_table_insert(trna_ftr)¶
Create a MySQL trna table INSERT statement. :param trna_ftr: a pdm_utils Trna object :type trna_ftr: Trna :returns: a MySQL statement to INSERT a new row in the ‘trna’ table with all of trna_ftr’s relevant data :rtype: str
- pdm_utils.functions.mysqldb.create_update(table, field2, value2, field1, value1)¶
Create MySQL UPDATE statement.
“‘UPDATE <table> SET <field2> = ‘<value2’ WHERE <field1> = ‘<data1>’.”
When the new value to be added is ‘singleton’ (e.g. for Cluster fields), or an empty value (e.g. None, “none”, etc.), the new value is set to NULL.
- Parameters
table (str) – The database table to insert information.
field1 (str) – The column upon which the statement is conditioned.
value1 (str) – The value of ‘field1’ upon which the statement is conditioned.
field2 (str) – The column that will be updated.
value2 (str) – The value that will be inserted into ‘field2’.
- Returns
A MySQL UPDATE statement.
- Return type
set
- pdm_utils.functions.mysqldb.execute_transaction(engine, statement_list=[])¶
Execute list of MySQL statements within a single defined transaction.
- Parameters
engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL databas.
statement_list – a list of any number of MySQL statements with no expectation that anything will return
- Returns
tuple (result, message) WHERE result (int) is 0 or 1 status code. 0 means no problems, 1 means problems message(str) is a description of the result.
- Return type
tuple
- pdm_utils.functions.mysqldb.get_schema_version(engine)¶
Identify the schema version of the database_versions_list.
Schema version data has not been persisted in every schema version, so if schema version data is not found, it is deduced from other parts of the schema.
- Parameters
engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.
- Returns
The version of the pdm_utils database schema.
- Return type
int
- pdm_utils.functions.mysqldb.parse_feature_data(engine, ftr_type, column=None, phage_id_list=None, query=None)¶
Returns Cds objects containing data parsed from a MySQL database.
- Parameters
engine (Engine) – This parameter is passed directly to the ‘retrieve_data’ function.
query (str) – This parameter is passed directly to the ‘retrieve_data’ function.
ftr_type (str) – Indicates the type of features retrieved.
column (str) – This parameter is passed directly to the ‘retrieve_data’ function.
phage_id_list (list) – This parameter is passed directly to the ‘retrieve_data’ function.
- Returns
A list of pdm_utils Cds objects.
- Return type
list
- pdm_utils.functions.mysqldb.parse_gene_table_data(data_dict, trans_table=11)¶
Parse a MySQL database dictionary to create a Cds object.
- Parameters
data_dict (dict) – Dictionary of data retrieved from the gene table.
trans_table (int) – The translation table that can be used to translate CDS features.
- Returns
A pdm_utils Cds object.
- Return type
- pdm_utils.functions.mysqldb.parse_genome_data(engine, phage_id_list=None, phage_query=None, gene_query=None, trna_query=None, tmrna_query=None, gnm_type='')¶
Returns a list of Genome objects containing data parsed from a MySQL database.
- Parameters
engine (Engine) – This parameter is passed directly to the ‘retrieve_data’ function.
phage_query (str) – This parameter is passed directly to the ‘retrieve_data’ function to retrieve data from the phage table.
gene_query (str) – This parameter is passed directly to the ‘parse_feature_data’ function to retrieve data from the gene table. If not None, pdm_utils Cds objects for all of the phage’s CDS features in the gene table will be constructed and added to the Genome object.
trna_query (str) – This parameter is passed directly to the ‘parse_feature_data’ function to retrieve data from the trna table. If not None, pdm_utils Trna objects for all of the phage’s tRNA features in the trna table will be constructed and added to the Genome object.
tmrna_query (str) – This parameter is passed directly to the ‘parse_feature_data’ function to retrieve data from the tmrna table. If not None, pdm_utils Tmrna objects for all of the phage’s tmRNA features in the tmrna table will be constructed and added to the Genome object.
phage_id_list (list) – This parameter is passed directly to the ‘retrieve_data’ function. If there is at at least one valid PhageID, a pdm_utils genome object will be constructed only for that phage. If None, or an empty list, genome objects for all phages in the database will be constructed.
gnm_type (str) – Identifier for the type of genome.
- Returns
A list of pdm_utils Genome objects.
- Return type
list
- pdm_utils.functions.mysqldb.parse_phage_table_data(data_dict, trans_table=11, gnm_type='')¶
Parse a MySQL database dictionary to create a Genome object.
- Parameters
data_dict (dict) – Dictionary of data retrieved from the phage table.
trans_table (int) – The translation table that can be used to translate CDS features.
gnm_type (str) – Identifier for the type of genome.
- Returns
A pdm_utils genome object.
- Return type
genome
- pdm_utils.functions.mysqldb.parse_tmrna_table_data(data_dict)¶
Parse a MySQL database dictionary to create a Tmrna object.
- Parameters
data_dict (dict) – Dictionary of data retrieved from the gene table.
- Returns
A pdm_utils Tmrna object.
- Return type
pdm_utils.functions.mysqldb_basic module¶
Basic functions to interact with MySQL and manage databases.
- pdm_utils.functions.mysqldb_basic.convert_for_sql(value, check_set={}, single=True)¶
Convert a value for inserting into MySQL.
- Parameters
value (misc) – Value that should be checked for conversion.
check_set (set) – Set of values to check against.
single (bool) – Indicates whether single quotes should be used.
- Returns
Returns either “NULL” or the value encapsulated in quotes (“‘value’” or ‘“value”’)
- Return type
str
- pdm_utils.functions.mysqldb_basic.copy_db(engine, new_database)¶
Copies a database.
- Parameters
engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database, which contains the name of the database that will be copied into the new database.
new_database (str) – Name of the new copied database.
- Returns
Indicates if copy was successful (0) or failed (1).
- Return type
int
- pdm_utils.functions.mysqldb_basic.create_db(engine, database)¶
Create a new, empty database.
- Parameters
engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.
database (str) – Name of the database to create.
- Returns
Indicates if create was successful (0) or failed (1).
- Return type
int
- pdm_utils.functions.mysqldb_basic.db_exists(engine, database)¶
Check if given name for a local MySQL database exists.
- Parameters
engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.
database (str) – The name of the database to check exists.
- Returns
Returns whether the database exists
- Return type
bool
- pdm_utils.functions.mysqldb_basic.drop_create_db(engine, database)¶
Creates a new, empty database.
- Parameters
engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.
database (str) – Name of the database to drop and create.
- Returns
Indicates if drop/create was successful (0) or failed (1).
- Return type
int
- pdm_utils.functions.mysqldb_basic.drop_db(engine, database)¶
Delete a database.
- Parameters
engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.
database (str) – Name of the database to drop.
- Returns
Indicates if drop was successful (0) or failed (1).
- Return type
int
- pdm_utils.functions.mysqldb_basic.first(engine, executable, return_dict=True)¶
Execute a query and get the first row of data.
- Parameters
engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.
executable (str) – Input an executable MySQL query.
return_dict (Boolean) – Toggle whether execute returns dict or tuple.
- Returns
Results from execution of given MySQL query.
- Return type
dict
- Return type
tuple
- pdm_utils.functions.mysqldb_basic.get_columns(engine, database, table_name)¶
Retrieve columns names from a table.
- Parameters
engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.
database (str) – Name of the database to query.
table_name (str) – Name of the table to query.
- Returns
Set of column names.
- Return type
set
- pdm_utils.functions.mysqldb_basic.get_distinct(engine, table, column, null=None)¶
Get set of distinct values currently in a MySQL database.
- Parameters
engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.
table (str) – A valid table in the database.
column (str) – A valid column in the table.
null (misc) – Replacement value for NULL data.
- Returns
A set of distinct values from the database.
- Return type
set
- pdm_utils.functions.mysqldb_basic.get_first_row_data(engine, table)¶
Retrieves data from the first row of a table.
- Parameters
engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.
- Returns
Dictionary where key = column name.
- Return type
dict
- pdm_utils.functions.mysqldb_basic.get_mysql_dbs(engine)¶
Retrieve database names from MySQL.
- Parameters
engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.
- Returns
Set of database names.
- Return type
set
- pdm_utils.functions.mysqldb_basic.get_table_count(engine, table)¶
Get the current number of genomes in the database.
- Parameters
engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.
- Returns
Number of rows from the phage table.
- Return type
int
- pdm_utils.functions.mysqldb_basic.get_tables(engine, database)¶
Retrieve tables names from the database.
- Parameters
engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.
- Returns
Set of table names.
- Return type
set
- pdm_utils.functions.mysqldb_basic.install_db(engine, schema_filepath)¶
Install a MySQL file into the indicated database.
- Parameters
engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL databas.
schema_filepath (Path) – Path to the MySQL database file.
- Returns
Indicates if copy was successful (0) or failed (1).
- Return type
int
- pdm_utils.functions.mysqldb_basic.mysql_login_command(username, password, database)¶
Construct list of strings representing a mysql command.
- pdm_utils.functions.mysqldb_basic.mysqldump_command(username, password, database)¶
Construct list of strings representing a mysqldump command.
- pdm_utils.functions.mysqldb_basic.pipe_commands(command1, command2)¶
Pipe one command into the other.
- pdm_utils.functions.mysqldb_basic.query_dict_list(engine, query)¶
Get the results of a MySQL query as a list of dictionaries.
- Parameters
engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.
query (str) – MySQL query statement.
- Returns
List of dictionaries, where each dictionary represents a row of data.
- Return type
list
- pdm_utils.functions.mysqldb_basic.query_set(engine, query)¶
Retrieve set of data from MySQL query.
- Parameters
engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.
query (str) – MySQL query statement.
- Returns
Set of queried data.
- Return type
set
- pdm_utils.functions.mysqldb_basic.retrieve_data(engine, column=None, query=None, id_list=None)¶
Retrieve genome data from a MySQL database for a single genome.
The query is modified to include one or more values.
- Parameters
engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.
query (str) – A MySQL query that selects valid, specific columns from the a valid table without conditioning on a specific column (e.g. ‘SELECT Column1, Column2 FROM table1’).
column (str) – A valid column in the table upon which the query can be conditioned.
id_list (list) – A list of valid values upon which the query can be conditioned. In conjunction with the ‘column’ parameter, the ‘query’ is modified (e.g. “WHERE Column1 IN (‘Value1’, ‘Value2’)”).
- Returns
A list of items, where each item is a dictionary of SQL data for each PhageID.
- Return type
list
- pdm_utils.functions.mysqldb_basic.scalar(engine, executable)¶
Execute a query and get the first field.
- Parameters
engine (Engine) – SQLAlchemy Engine object able to connect to a MySQL database.
executable (str) – Input an executable MySQL query.
- Returns
Scalar result from execution of given MySQL query.
- Return type
int
pdm_utils.functions.ncbi module¶
Misc. functions to interact with NCBI databases.
- pdm_utils.functions.ncbi.get_accessions_to_retrieve(summary_records)¶
Extract accessions from summary records.
- Parameters
summary_records (list) – List of dictionaries, where each dictionary is a record summary.
- Returns
List of accessions.
- Return type
list
- pdm_utils.functions.ncbi.get_data_handle(accession_list, db='nucleotide', rettype='gb', retmode='text')¶
- pdm_utils.functions.ncbi.get_records(accession_list, db='nucleotide', rettype='gb', retmode='text')¶
Retrieve records from NCBI from a list of active accessions.
Uses NCBI efetch implemented through BioPython Entrez.
- Parameters
accession_list (list) – List of NCBI accessions.
db (str) – Name of the database to get summaries from (e.g. ‘nucleotide’).
rettype (str) – Type of record to retrieve (e.g. ‘gb’).
retmode (str) – Format of data to retrieve (e.g. ‘text’).
- Returns
List of BioPython SeqRecords generated from GenBank records.
- Return type
list
- pdm_utils.functions.ncbi.get_summaries(db='', query_key='', webenv='')¶
Retrieve record summaries from NCBI for a list of accessions.
Uses NCBI esummary implemented through BioPython Entrez.
- Parameters
db (str) – Name of the database to get summaries from.
query_key (str) – Identifier for the search. This can be directly generated from run_esearch().
webenv (str) – Identifier that can be directly generated from run_esearch()
- Returns
List of dictionaries, where each dictionary is a record summary.
- Return type
list
- pdm_utils.functions.ncbi.get_verified_data_handle(acc_id_dict, ncbi_cred_dict={}, batch_size=200, file_type='gb')¶
Retrieve genomes from GenBank.
output_folder = Path to where files will be saved. acc_id_dict = Dictionary where key = Accession and value = List[PhageIDs]
- pdm_utils.functions.ncbi.run_esearch(db='', term='', usehistory='')¶
Search for valid records in NCBI.
Uses NCBI esearch implemented through BioPython Entrez.
- Parameters
db (str) – Name of the database to search.
term (str) – Search term.
usehistory (str) – Indicates if prior searches should be used.
- Returns
Results of the search for each valid record.
- Return type
dict
- pdm_utils.functions.ncbi.set_entrez_credentials(tool=None, email=None, api_key=None)¶
Set BioPython Entrez credentials to improve speed and reliability.
- Parameters
tool (str) – Name of the software/tool being used.
email (str) – Email contact information for NCBI.
api_key (str) – Unique NCBI-issued identifier to enhance retrieval speed.
pdm_utils.functions.parallelize module¶
Functions to parallelize of processing of a list of inputs. Adapted from https://docs.python.org/3/library/multiprocessing.html
- pdm_utils.functions.parallelize.count_processors(inputs, num_processors)¶
Programmatically determines whether the specified num_processors is appropriate. There’s no need to use more processors than there are inputs, and it’s impossible to use fewer than 1 processor or more than exist on the machine running the code. :param inputs: list of inputs :param num_processors: specified number of processors :return: num_processors (optimized)
- pdm_utils.functions.parallelize.parallelize(inputs, num_processors, task, verbose=True)¶
Parallelizes some task on an input list across the specified number of processors :param inputs: list of inputs :param num_processors: number of processor cores to use :param task: name of the function to run :param verbose: updating progress bar output? :return: results
- pdm_utils.functions.parallelize.start_processes(inputs, num_processors, verbose)¶
Creates input and output queues, and runs the jobs :param inputs: jobs to run :param num_processors: optimized number of processors :param verbose: updating progress bar output? :return: results
- pdm_utils.functions.parallelize.worker(input_queue, output_queue)¶
pdm_utils.functions.parsing module¶
- pdm_utils.functions.parsing.check_operator(operator, column_object)¶
Validates an operator’s application on a MySQL column.
- Parameters
operator (str) – Accepted MySQL operator.
column_object (Column) – A SQLAlchemy Column object.
- pdm_utils.functions.parsing.create_filter_key(unparsed_filter)¶
Creates a standardized filter string from a valid unparsed_filter.
- Parameters
unparsed_filter – Formatted MySQL WHERE clause.
- Returns
Standardized MySQL conditional string.
- Return type
str
- pdm_utils.functions.parsing.parse_cmd_list(unparsed_string_list)¶
Recognizes and parses MySQL WHERE clause structures from cmd lists.
- Parameters
unparsed_string_list (list[str]) – Formatted MySQL WHERE clause arguments.
- Returns
2-D array containing lists of statements joined by ORs.
- Return type
list[list]
- pdm_utils.functions.parsing.parse_cmd_string(unparsed_cmd_string)¶
Recognizes and parses MySQL WHERE clause structures.
- Parameters
unparsed_cmd_string (str) – Formatted MySQL WHERE clause string.
- Returns
2-D array containing lists of statements joined by ORs.
- Return type
list[list]
- pdm_utils.functions.parsing.parse_column(unparsed_column)¶
Recognizes and parses a MySQL structured column.
- Parameters
unparsed_column (str) – Formatted MySQL column.
- Returns
List containing segments of a MySQL column.
- Return type
list[str]
- pdm_utils.functions.parsing.parse_filter(unparsed_filter)¶
Recognizes and parses a MySQL structured WHERE clause.
- Parameters
unparsed_filter – Formatted MySQL WHERE clause.
- Returns
List containing segments of a MySQL WHERE clause.
- Return type
list[str]
- pdm_utils.functions.parsing.parse_in_spaces(unparsed_string_list)¶
Convert a list of strings to a single space separated string.
- Parameters
unparsed_string_list (list[str]) – String list to be concatenated
- Returns
String with parsed in whitespace.
- Return type
str
- pdm_utils.functions.parsing.parse_out_ends(unparsed_string)¶
Parse and remove beginning and end whitespace of a string.
- Parameters
unparsed_string (str) – String with variable terminal whitespaces.
- Returns
String with parsed and removed beginning and ending whitespace.
- Return type
str
- pdm_utils.functions.parsing.parse_out_spaces(unparsed_string)¶
Parse and remove beginning and internal white space of a string.
- Parameters
unparsed_string (str) – String with variable terminal whitespaces.
- Returns
String with parsed and removed beginning and ending whitespace.
- Return type
str
- pdm_utils.functions.parsing.translate_column(metadata, raw_column)¶
Converts a case-insensitve {table}.{column} str to a case-sensitive str.
- Parameters
metadata (MetaData) – Reflected SQLAlchemy MetaData object.
raw_column (str) – Case-insensitive {table}.{column}.
- Returns
Case-sensitive column name.
- Return type
str
- pdm_utils.functions.parsing.translate_table(metadata, raw_table)¶
Converts a case-insensitive table name to a case-sensitive str.
- Parameters
metadata (MetaData) – Reflected SQLAlchemy MetaData object.
raw_table – Case-insensitive table name.
- Type_table
str
- Returns
Case-sensitive table name.
- Return type
str
pdm_utils.functions.phagesdb module¶
Functions to interact with PhagesDB
- pdm_utils.functions.phagesdb.construct_phage_url(phage_name)¶
Create URL to retrieve phage-specific data from PhagesDB.
- Parameters
phage_name (str) – Name of the phage of interest.
- Returns
URL pertaining to the phage.
- Return type
str
- pdm_utils.functions.phagesdb.create_cluster_subcluster_sets(url='https://phagesdb.org/api/clusters/')¶
Create sets of clusters and subclusters currently in PhagesDB.
- Parameters
url (str) – A URL from which to retrieve cluster and subcluster data.
- Returns
tuple (cluster_set, subcluster_set) WHERE cluster_set(set) is a set of all unique clusters on PhagesDB. subcluster_set(set) is a set of all unique subclusters on PhagesDB.
- Return type
tuple
- pdm_utils.functions.phagesdb.create_host_genus_set(url='https://phagesdb.org/api/host_genera/')¶
Create a set of host genera currently in PhagesDB.
- Parameters
url (str) – A URL from which to retrieve host genus data.
- Returns
All unique host genera listed on PhagesDB.
- Return type
set
- pdm_utils.functions.phagesdb.get_genome(phage_id, gnm_type='', seq=False)¶
Get genome data from PhagesDB.
- Parameters
phage_id (str) – The name of the phage to be retrieved from PhagesDB.
gnm_type (str) – Identifier for the type of genome.
seq (bool) – Indicates whether the genome sequence should be retrieved.
- Returns
A pdm_utils Genome object with the parsed data. If not genome is retrieved, None is returned.
- Return type
- pdm_utils.functions.phagesdb.get_phagesdb_data(url)¶
Retrieve all sequenced genome data from PhagesDB.
- Parameters
url (str) – URL to connect to PhagesDB API.
- Returns
List of dictionaries, where each dictionary contains data for each phage. If a problem is encountered during retrieval, an empty list is returned.
- Return type
list
- pdm_utils.functions.phagesdb.get_unphamerated_phage_list(url)¶
Retreive list of unphamerated phages from PhagesDB.
- Parameters
url (str) – A URL from which to retrieve a list of PhagesDB genomes that are not in the most up-to-date instance of the Actino_Draft MySQL database.
- Returns
List of PhageIDs.
- Return type
list
- pdm_utils.functions.phagesdb.parse_accession(data_dict)¶
Retrieve Accession from PhagesDB.
- Parameters
data_dict (dict) – Dictionary of data retrieved from PhagesDB.
- Returns
Accession of the phage.
- Return type
str
- pdm_utils.functions.phagesdb.parse_cluster(data_dict)¶
Retrieve Cluster from PhagesDB.
If the phage is clustered, ‘pcluster’ is a dictionary, and one key is the Cluster data (Cluster or ‘Singleton’). If for some reason no Cluster info is added at the time the genome is added to PhagesDB, ‘pcluster’ may automatically be set to NULL, which gets converted to “Unclustered” during retrieval. In the MySQL database NULL means Singleton, and the long form “Unclustered” is invalid due to its character length, so this value is converted to ‘UNK’ (‘Unknown’).
- Parameters
data_dict (dict) – Dictionary of data retrieved from PhagesDB.
- Returns
Cluster of the phage.
- Return type
str
- pdm_utils.functions.phagesdb.parse_fasta_data(fasta_data)¶
Parses data returned from a fasta-formatted file.
- Parameters
fasta_data (str) – Data from a fasta file.
- Returns
tuple (header, sequence) WHERE header(str) is the first line parsed from the parsed file. sequence(str) is the nucleotide sequence parsed from the file.
- Return type
tuple
- pdm_utils.functions.phagesdb.parse_fasta_filename(data_dict)¶
Retrieve fasta filename from PhagesDB.
- Parameters
data_dict (dict) – Dictionary of data retrieved from PhagesDB.
- Returns
Name of the fasta file for the phage.
- Return type
str
- pdm_utils.functions.phagesdb.parse_genome_data(data_dict, gnm_type='', seq=False)¶
Parses a dictionary of PhagesDB genome data into a pdm_utils Genome object.
- Parameters
data_dict (dict) – Dictionary of data retrieved from PhagesDB.
gnm_type (str) – Identifier for the type of genome.
seq (bool) – Indicates whether the genome sequence should be retrieved.
- Returns
A pdm_utils Genome object with the parsed data.
- Return type
- pdm_utils.functions.phagesdb.parse_genomes_dict(data_dict, gnm_type='', seq=False)¶
Returns a dictionary of pdm_utils Genome objects
- Parameters
data_dict (dict) – Dictionary of dictionaries. Key = PhageID. Value = Dictionary of genome data retrieved from PhagesDB.
gnm_type (str) – Identifier for the type of genome.
seq (bool) – Indicates whether the genome sequence should be retrieved.
- Returns
Dictionary of pdm_utils Genome object. Key = PhageID. Value = Genome object.
- Return type
dict
- pdm_utils.functions.phagesdb.parse_host_genus(data_dict)¶
Retrieve host_genus from PhagesDB.
- Parameters
data_dict (dict) – Dictionary of data retrieved from PhagesDB.
- Returns
Host genus of the phage.
- Return type
str
- pdm_utils.functions.phagesdb.parse_phage_name(data_dict)¶
Retrieve Phage Name from PhagesDB.
- Parameters
data_dict (dict) – Dictionary of data retrieved from PhagesDB.
- Returns
Name of the phage.
- Return type
str
- pdm_utils.functions.phagesdb.parse_subcluster(data_dict)¶
Retrieve Subcluster from PhagesDB.
If for some reason no cluster info is added at the time the genome is added to PhagesDB, ‘psubcluster’ may automatically be set to NULL, which gets returned as None. If the phage is a Singleton, ‘psubcluster’ is None. If the phage is clustered but not subclustered, ‘psubcluster’ is None. If the phage is clustered and subclustered, ‘psubcluster’ is a dictionary, and one key is the Subcluster data.
- Parameters
data_dict (dict) – Dictionary of data retrieved from PhagesDB.
- Returns
Subcluster of the phage.
- Return type
str
- pdm_utils.functions.phagesdb.retrieve_data_list(url)¶
Retrieve list of data from PhagesDB.
- Parameters
url (str) – A URL from which to retrieve data.
- Returns
A list of data retrieved from the URL.
- Return type
list
- pdm_utils.functions.phagesdb.retrieve_genome_data(phage_url)¶
Retrieve all data from PhagesDB for a specific phage.
- Parameters
phage_url (str) – URL for data pertaining to a specific phage.
- Returns
Dictionary of data parsed from the URL.
- Return type
dict
- pdm_utils.functions.phagesdb.retrieve_url_data(url)¶
Retrieve fasta file from PhagesDB.
- Parameters
url (str) – URL for data to be retrieved.
- Returns
Data from the URL.
- Return type
str
pdm_utils.functions.phameration module¶
Functions that are used in the phameration pipeline
- pdm_utils.functions.phameration.blastp(index, chunk, tmp, db_path, evalue, query_cov)¶
Runs ‘blastp’ using the given chunk as the input gene set. The blast output is an adjacency matrix for this chunk. :param index: chunk index being run :type index: int :param chunk: the translations to run right now :type chunk: tuple of 2-tuples :param tmp: path where I/O can go on :type tmp: str :param db_path: path to the target blast database :type db_path: str :param evalue: e-value cutoff to report hits :type evalue: float
- pdm_utils.functions.phameration.chunk_translations(translation_groups, chunksize=500)¶
Break translation_groups into a dictionary of chunksize-tuples of 2-tuples where each 2-tuple is a translation and its corresponding geneid. :param translation_groups: translations and their geneids :type translation_groups: dict :param chunksize: how many translations will be in a chunk? :type chunksize: int :return: chunks :rtype: dict
- pdm_utils.functions.phameration.create_blastdb(fasta, db_name, db_path)¶
Runs ‘makeblastdb’ to create a BLAST-searchable database. :param fasta: FASTA-formatted input file :type fasta: str :param db_name: BLAST sequence database :type db_name: str :param db_path: BLAST sequence database path :type db_path: str
- pdm_utils.functions.phameration.fix_colored_orphams(engine)¶
Find any single-member phams which are colored as though they are multi-member phams (not #FFFFFF in pham.Color). :param engine: sqlalchemy Engine allowing access to the database :return:
- pdm_utils.functions.phameration.fix_white_phams(engine)¶
Find any phams with 2+ members which are colored as though they are orphams (#FFFFFF in pham.Color). :param engine: sqlalchemy Engine allowing access to the database :return:
- pdm_utils.functions.phameration.get_geneids_and_translations(engine)¶
Constructs a dictionary mapping all geneids to their translations. :param engine: the Engine allowing access to the database :return: gs_to_ts
- pdm_utils.functions.phameration.get_new_geneids(engine)¶
Queries the database for those genes that are not yet phamerated. :param engine: the Engine allowing access to the database :return: new_geneids
- pdm_utils.functions.phameration.get_pham_colors(engine)¶
Queries the database for the colors of existing phams :param engine: the Engine allowing access to the database :return: pham_colors
- pdm_utils.functions.phameration.get_pham_geneids(engine)¶
Queries the database for those genes that are already phamerated. :param engine: the Engine allowing access to the database :return: pham_geneids
- pdm_utils.functions.phameration.get_translation_groups(engine)¶
Constructs a dictionary mapping all unique translations to their groups of geneids that share them :param engine: the Engine allowing access to the database :return: ts_to_gs
- pdm_utils.functions.phameration.markov_cluster(adj_mat_file, inflation, tmp_dir)¶
Run ‘mcl’ on an adjacency matrix to cluster the blastp results. :param adj_mat_file: 3-column file with blastp resultant queries, subjects, and evalues :type adj_mat_file: str :param inflation: mcl inflation parameter :type inflation: float :param tmp_dir: file I/O directory :type tmp_dir: str :return: outfile :rtype: str
- pdm_utils.functions.phameration.merge_pre_and_hmm_phams(hmm_phams, pre_phams, consensus_lookup)¶
Merges the pre-pham sequences (which contain all nr sequences) with the hmm phams (which contain only hmm consensus sequences) into the full hmm-based clustering output. Uses consensus_lookup dictionary to find the pre-pham that each consensus belongs to, and then adds each pre-pham geneid to a full pham based on the hmm phams. :param hmm_phams: clustered consensus sequences :type hmm_phams: dict :param pre_phams: clustered sequences (used to generate hmms) :type pre_phams: dict :param consensus_lookup: reverse-mapped pre_phams :type consensus_lookup: dict :return: phams :rtype: dict
- pdm_utils.functions.phameration.mmseqs_clust(consensus_db, align_db, cluster_db)¶
Runs ‘mmseqs clust’ to cluster an MMseqs2 consensus database using an MMseqs2 alignment database, with results being saved to an MMseqs2 cluster database. :param consensus_db: MMseqs sequence database :type consensus_db: str :param align_db: MMseqs2 alignment database :type align_db: str :param cluster_db: MMseqs2 cluster database :type cluster_db: str
- pdm_utils.functions.phameration.mmseqs_cluster(sequence_db, cluster_db, args)¶
Runs ‘mmseqs cluster’ to cluster an MMseqs2 sequence database. :param sequence_db: MMseqs2 sequence database :type sequence_db: str :param cluster_db: MMseqs2 clustered database :type cluster_db: str :param args: parsed command line arguments :type args: dict
- pdm_utils.functions.phameration.mmseqs_createdb(fasta, sequence_db)¶
Runs ‘mmseqs createdb’ to convert a FASTA file into an MMseqs2 sequence database. :param fasta: path to the FASTA file to convert :type fasta: str :param sequence_db: MMseqs2 sequence database :type sequence_db: str
- pdm_utils.functions.phameration.mmseqs_createseqfiledb(sequence_db, cluster_db, seqfile_db)¶
Runs ‘mmseqs createseqfiledb’ to create the intermediate to the FASTA-like parseable output. :param sequence_db: MMseqs2 sequence database :type sequence_db: str :param cluster_db: MMseqs2 clustered database :type cluster_db: str :param seqfile_db: MMseqs2 seqfile database :type seqfile_db: str
- pdm_utils.functions.phameration.mmseqs_profile2consensus(profile_db, consensus_db)¶
Runs ‘mmseqs profile2consensus’ to extract consensus sequences from an MMseqs2 profile database, and creates an MMseqs2 sequence database from the consensuses. :param profile_db: MMseqs2 profile database :type profile_db: str :param consensus_db: MMseqs2 sequence database :type consensus_db: str
- pdm_utils.functions.phameration.mmseqs_result2flat(query_db, target_db, seqfile_db, outfile)¶
Runs ‘mmseqs result2flat’ to create FASTA-like parseable output. :param query_db: MMseqs2 sequence or profile database :type query_db: str :param target_db: MMseqs2 sequence database :type target_db: str :param seqfile_db: MMseqs2 seqfile database :type seqfile_db: str :param outfile: FASTA-like parseable output :type outfile: str
- pdm_utils.functions.phameration.mmseqs_result2profile(sequence_db, cluster_db, profile_db)¶
Runs ‘mmseqs result2profile’ to convert clusters from one MMseqs2 clustered database into a profile database. :param sequence_db: MMseqs2 sequence database :type sequence_db: str :param cluster_db: MMseqs2 clustered database :type cluster_db: str :param profile_db: MMseqs2 profile database :type profile_db: str
- pdm_utils.functions.phameration.mmseqs_search(profile_db, consensus_db, align_db, args)¶
Runs ‘mmseqs search’ to search profiles against their consensus sequences and save the alignment results to an MMseqs2 alignment database. The profile_db and consensus_db MUST be the same size. :param profile_db: MMseqs2 profile database :type profile_db: str :param consensus_db: MMseqs2 sequence database :type consensus_db: str :param align_db: MMseqs2 alignment database :type align_db: str :param args: parsed command line arguments :type args: dict
- pdm_utils.functions.phameration.parse_mcl_output(outfile)¶
Parse the mci output into phams :param outfile: mci output file :type outfile: str :return: phams :rtype: dict
- pdm_utils.functions.phameration.parse_mmseqs_output(outfile)¶
Parses the indicated MMseqs2 FASTA-like file into a dictionary of integer-named phams. :param outfile: FASTA-like parseable output :type outfile: str :return: phams :rtype: dict
- pdm_utils.functions.phameration.preserve_phams(old_phams, new_phams, old_colors, new_genes)¶
Attempts to keep pham numbers consistent from one round of pham building to the next :param old_phams: the dictionary that maps old phams to their genes :param new_phams: the dictionary that maps new phams to their genes :param old_colors: the dictionary that maps old phams to colors :param new_genes: the set of previously unphamerated genes :return:
- pdm_utils.functions.phameration.reintroduce_duplicates(new_phams, trans_groups, genes_and_trans)¶
Reintroduces into each pham ALL GeneIDs that map onto the set of translations in the pham. :param new_phams: the pham dictionary for which duplicates are to be reintroduced :param trans_groups: the dictionary that maps translations to the GeneIDs that share them :param genes_and_trans: the dictionary that maps GeneIDs to their translations :return:
- pdm_utils.functions.phameration.update_gene_table(phams, engine)¶
Updates the gene table with new pham data :param phams: new pham gene data :type phams: dict :param engine: sqlalchemy Engine allowing access to the database :return:
- pdm_utils.functions.phameration.update_pham_table(colors, engine)¶
Populates the pham table with the new PhamIDs and their colors. :param colors: new pham color data :type colors: dict :param engine: sqlalchemy Engine allowing access to the database :return:
- pdm_utils.functions.phameration.write_fasta(translation_groups, outfile)¶
Writes a FASTA file of the non-redundant protein sequences to be assorted into phamilies. :param translation_groups: groups of genes that share a translation :type translation_groups: dict :param outfile: FASTA filename :type outfile: str :return:
pdm_utils.functions.pipelines_basic module¶
- pdm_utils.functions.pipelines_basic.add_sort_columns(db_filter, sort_columns, verbose=False)¶
- pdm_utils.functions.pipelines_basic.build_alchemist(database, ask_database=True, config=None, dialect='mysql')¶
- pdm_utils.functions.pipelines_basic.build_filter(alchemist, key, filters, values=None, verbose=False)¶
Applies MySQL WHERE clause filters using a Filter.
- Parameters
alchemist (AlchemyHandler) – A connected and fully built AlchemyHandler object.
table (str) – MySQL table name.
filters (list[list[str]]) – A list of lists with filter values, grouped by ORs.
groups (list[str]) – A list of supported MySQL column names.
- Returns
filter-Loaded Filter object.
- Return type
- pdm_utils.functions.pipelines_basic.build_groups_map(db_filter, export_path, groups=[], verbose=False, force=False, dump=False)¶
Function that generates a map between conditionals and grouping paths.
- Parameters
db_filter (Filter) – A connected and fully loaded Filter object.
export_path – Path to a dir for new dir creation.
groups (list[str]) – A list of supported MySQL column names.
conditionals_map (dict{Path:list}) – A mapping between group conditionals and Paths.
verbose (bool) – A boolean value to toggle progress print statements.
previous (str) – Value set by function to provide info for print statements
depth (int) – Value set by function to provide info for print statements.
- Returns conditionals_map
A mapping between group conditionals and Paths.
- Return type
dict{Path:list}
- pdm_utils.functions.pipelines_basic.build_groups_tree(db_filter, export_path, conditionals_map, groups=[], verbose=False, force=False, previous=None, depth=0)¶
Recursive function that generates directories based on groupings.
- Parameters
db_filter (Filter) – A connected and fully loaded Filter object.
export_path – Path to a dir for new dir creation.
groups (list[str]) – A list of supported MySQL column names.
conditionals_map (dict{Path:list} :param verbose: A boolean value to toggle progress print statements.) – A mapping between group conditionals and Paths.
previous (str) – Value set by function to provide info for print statements
depth (int) – Value set by function to provide info for print statements.
- Returns conditionals_map
A mapping between group conditionals and Paths.
- Return type
dict{Path:list}
- pdm_utils.functions.pipelines_basic.convert_dir_path(path)¶
Function to convert argparse input to a working directory path.
- Parameters
path (str) – A string to be converted into a Path object.
- Returns
A Path object converted from the inputed string.
- Return type
Path
- pdm_utils.functions.pipelines_basic.convert_file_path(path)¶
Function to convert argparse input to a working file path.
- Parameters
path (str) – A string to be converted into a Path object.
- Returns
A Path object converted from the inputed string.
- Return type
Path
- pdm_utils.functions.pipelines_basic.create_default_path(name, force=False, attempt=50)¶
- pdm_utils.functions.pipelines_basic.create_working_dir(working_path, dump=False, force=False)¶
- pdm_utils.functions.pipelines_basic.create_working_path(folder_path, folder_name, dump=False, force=False, attempt=50)¶
- pdm_utils.functions.pipelines_basic.parse_value_input(value_list_input)¶
- pdm_utils.functions.pipelines_basic.parse_value_input(value_list_input: pathlib.Path)
- pdm_utils.functions.pipelines_basic.parse_value_input(value_list_input: list)
Function to convert values input to a recognized data types.
- Parameters
value_list_input (Path) – Values stored in recognized data types.
- Returns
List of values to filter database results.
- Return type
list[str]
pdm_utils.functions.querying module¶
- pdm_utils.functions.querying.append_group_by_clauses(executable, group_by_clauses)¶
Add GROUP BY SQLAlchemy Column objects to a Select object.
- Parameters
executable (Select) – SQLAlchemy executable query object.
order_by_clauses (list) – MySQL GROUP BY clause-related SQLAlchemy object(s)
- Returns
MySQL expression-related SQLAlchemy exectuable.
- Return type
Select
- pdm_utils.functions.querying.append_having_clauses(executable, having_clauses)¶
Add HAVING SQLAlchemy Column objects to a Select object.
- Parameters
executable (Select) – SQLAlchemy executable query object.
having_clauses – MySQL HAVING clause-related SQLAlchemy object(s).
:returns MySQL expression-related SQLAlchemy executable. :rtype: Select
- pdm_utils.functions.querying.append_order_by_clauses(executable, order_by_clauses)¶
Add ORDER BY SQLAlchemy Column objects to a Select object.
- Parameters
executable (Select) – SQLAlchemy executable query object.
order_by_clauses (list) – MySQL ORDER BY clause-related SQLAlchemy object(s)
- Returns
MySQL expression-related SQLAlchemy exectuable.
- Return type
Select
- pdm_utils.functions.querying.append_where_clauses(executable, where_clauses)¶
Add WHERE SQLAlchemy BinaryExpression objects to a Select object.
- Parameters
executable (Select) – SQLAlchemy executable query object.
where_clauses (list) – MySQL WHERE clause-related SQLAlchemy object(s).
- Returns
MySQL expression-related SQLAlchemy exectuable.
- Return type
Select
- pdm_utils.functions.querying.build_count(db_graph, columns, where=None, add_in=None)¶
Get MySQL COUNT() expression SQLAlchemy executable.
- Parameters
db_graph (Graph) – SQLAlchemy structured NetworkX Graph object.
columns (list) – SQLAlchemy Column object(s).
where (list) – MySQL WHERE clause-related SQLAlchemy object(s).
add_in (list) – MySQL Column-related inputs to be considered for joining.
- Returns
MySQL COUNT() expression-related SQLAlchemy executable.
- Return type
Select
- pdm_utils.functions.querying.build_distinct(db_graph, columns, where=None, order_by=None, add_in=None)¶
Get MySQL DISTINCT expression SQLAlchemy executable.
- Parameters
db_graph (Graph) – SQLAlchemy structured NetworkX Graph object.
columns (list) – SQLAlchemy Column object(s).
where (list) – MySQL WHERE clause-related SQLAlchemy object(s).
order_by (list) – MySQL ORDER BY clause-related SQLAlchemy object(s).
add_in (list) – MySQL Column-related inputs to be considered for joining.
- Returns
MySQL DISTINCT expression-related SQLAlchemy executable.
- Return type
Select
- pdm_utils.functions.querying.build_fromclause(db_graph, columns)¶
Get a joined table from pathing instructions for joining MySQL Tables. :param db_graph: SQLAlchemy structured NetworkX Graph object. :type db_graph: Graph :param columns: SQLAlchemy Column object(s). :type columns: Column :type columns: list :returns: SQLAlchemy Table object containing left outer-joined tables. :rtype: Table
- pdm_utils.functions.querying.build_graph(metadata)¶
Get a NetworkX Graph object populated from a SQLAlchemy MetaData object.
- Parameters
metadata (MetaData) – Reflected SQLAlchemy MetaData object.
- Returns
Populated and structured NetworkX Graph object.
- Return type
Column
- pdm_utils.functions.querying.build_onclause(db_graph, source_table, adjacent_table)¶
- Creates a SQLAlchemy BinaryExpression object for a MySQL ON clause
expression
- Parameters
db_graph (Graph) – SQLAlchemy structured NetworkX Graph object.
source_table (str) – Case-insensitive MySQL table name.
adjacent_table – Case-insensitive MySQL table name.
- Returns
MySQL foreign key related SQLAlchemy BinaryExpression object.
- Return type
BinaryExpression
- pdm_utils.functions.querying.build_select(db_graph, columns, where=None, order_by=None, add_in=None, having=None, group_by=None)¶
Get MySQL SELECT expression SQLAlchemy executable.
- Parameters
db_graph (Graph) – SQLAlchemy structured NetworkX Graph object.
columns (list) – SQLAlchemy Column object(s).
where (list) – MySQL WHERE clause-related SQLAlchemy object(s).
order_by (list) – MySQL ORDER BY clause-related SQLAlchemy object(s).
add_in (list) – MySQL Column-related inputs to be considered for joining.
having (list) – MySQL HAVING clause-related SQLAlchemy object(s).
group_by (list) – MySQL GROUP BY clause-related SQLAlchemy object(s).
- Returns
MySQL SELECT expression-related SQLAlchemy executable.
- Return type
Select
- pdm_utils.functions.querying.build_where_clause(db_graph, filter_expression)¶
- Creates a SQLAlchemy BinaryExpression object from a MySQL WHERE
clause expression.
- Parameters
db_graph (Graph) – SQLAlchemy structured NetworkX Graph object.
filter_expression (str) – MySQL where clause expression.
- Returns
MySQL expression-related SQLAlchemy BinaryExpression object.
- Return type
BinaryExpression
- pdm_utils.functions.querying.execute(engine, executable, in_column=None, values=[], limit=8000, return_dict=True)¶
Use SQLAlchemy Engine to execute a MySQL query.
- Parameters
engine (Engine) – SQLAlchemy Engine object used for executing queries.
executable (str) – Input a executable MySQL query.
return_dict (Boolean) – Toggle whether execute returns dict or tuple.
- Returns
Results from execution of given MySQL query.
- Return type
list[dict]
- Return type
list[tuple]
- pdm_utils.functions.querying.execute_value_subqueries(engine, executable, in_column, source_values, return_dict=True, limit=8000)¶
Query with a conditional on a set of values using subqueries.
- Parameters
engine (Engine) – SQLAlchemy Engine object used for executing queries.
executable (str) – Input a executable MySQL query.
in_column (Column) – SQLAlchemy Column object.
source_values (list[str]) – Values from specified MySQL column.
return_dict (Boolean) – Toggle whether to return data as a dictionary.
limit (int) – SQLAlchemy IN clause query length limiter.
- Returns
List of grouped data for each value constraint.
- Return type
list
- pdm_utils.functions.querying.extract_column(column, check=None)¶
Get a column from a supported SQLAlchemy Column-related object.
- Parameters
column (UnaryExpression) – SQLAlchemy Column-related object.
check (<type BinaryExpression>) – SQLAlchemy Column-related object type.
- Returns
Corresponding SQLAlchemy Column object.
- Return type
Column
- pdm_utils.functions.querying.extract_columns(columns, check=None)¶
Get a column from a supported SQLAlchemy Column-related object(s).
- Parameters
column (UnaryExpression) – SQLAlchemy Column-related object.
check (<type BinaryExpression>) – SQLAlchemy Column-related object type.
- Returns
List of SQLAlchemy Column objects.
- Return type
list[Column]
- pdm_utils.functions.querying.first_column(engine, executable, in_column=None, values=[], limit=8000)¶
Use SQLAlchemy Engine to execute and return the first column of fields.
- Parameters
engine (Engine) – SQLAlchemy Engine object used for executing queries.
executable (str) – Input an executable MySQL query.
- Returns
A column for a set of MySQL values.
- Return type
list[str]
- pdm_utils.functions.querying.first_column_value_subqueries(engine, executable, in_column, source_values, limit=8000)¶
Query with a conditional on a set of values using subqueries.
- Parameters
engine (Engine) – SQLAlchemy Engine object used for executing queries.
executable (str) – Input a executable MySQL query.
in_column (Column) – SQLAlchemy Column object.
source_values (list[str]) – Values from specified MySQL column.
return_dict (Boolean) – Toggle whether to return data as a dictionary.
limit (int) – SQLAlchemy IN clause query length limiter.
- Returns
Distinct values fetched from value constraints.
- Return type
list
- pdm_utils.functions.querying.get_column(metadata, column)¶
Get a SQLAlchemy Column object, with a case-insensitive input. Input must be formatted {Table_name}.{Column_name}.
- Parameters
metadata (MetaData) – Reflected SQLAlchemy MetaData object.
table (str) – Case-insensitive column name.
- Returns
Corresponding SQLAlchemy Column object.
- Return type
Column
- pdm_utils.functions.querying.get_table(metadata, table)¶
Get a SQLAlchemy Table object, with a case-insensitive input.
- Parameters
metadata (MetaData) – Reflected SQLAlchemy MetaData object.
table (str) – Case-insensitive table name.
- Returns
Corresponding SQLAlchemy Table object.
- Return type
Table
- pdm_utils.functions.querying.get_table_list(columns)¶
Get a nonrepeating list SQLAlchemy Table objects from Column objects.
- Parameters
columns (list) – SQLAlchemy Column object(s).
- Returns
List of corresponding SQLAlchemy Table objects.
- Return type
list
- pdm_utils.functions.querying.get_table_pathing(db_graph, table_list, center_table=None)¶
Get pathing instructions for joining MySQL Table objects.
- Parameters
db_graph (Graph) – SQLAlchemy structured NetworkX Graph object.
table_list (list[Table]) – List of SQLAlchemy Table objects.
center_table (Table) – SQLAlchemy Table object to begin traversals from.
- Returns
2-D list containing the center table and pathing instructions.
- Return type
list
- pdm_utils.functions.querying.join_pathed_tables(db_graph, table_pathing)¶
Get a joined table from pathing instructions for joining MySQL Tables.
- Parameters
db_graph (Graph) – SQLAlchemy structured NetworkX Graph object.
table_pathing (list) – 2-D list containing a Table and pathing lists.
- Returns
SQLAlchemy Table object containing left outer-joined tables.
- Return type
Table
- pdm_utils.functions.querying.query(session, db_graph, table_map, where=None)¶
Use SQLAlchemy session to retrieve ORM objects from a mapped object.
- Parameters
session (Session) – Bound and connected SQLAlchemy Session object.
table_map – SQLAlchemy ORM map object.
where (list) – MySQL WHERE clause-related SQLAlchemy object(s).
order_by (list) – MySQL ORDER BY clause-related SQLAlchemy object(s).
- Returns
List of mapped object instances.
- Return type
list
pdm_utils.functions.server module¶
Misc. functions to utilizes server.
- pdm_utils.functions.server.get_transport(host)¶
Create paramiko Transport with the server name.
- Parameters
host (str) – Server to connect to.
- Returns
Paramiko Transport object. If the server is not available, None is returned.
- Return type
Transport
- pdm_utils.functions.server.set_log_file(filepath)¶
Set the filepath used to stored the Paramiko output.
This is a soft requirement for compliance with Paramiko standards. If it is not set, paramiko throws an error.
- Parameters
filepath (Path) – Path to file to log Paramiko results.
- pdm_utils.functions.server.setup_sftp_conn(transport, user=None, pwd=None, attempts=1)¶
Get credentials and setup connection to the server.
- Parameters
transport (Transport) – Paramiko Transport object directed towards a valid server.
attempts (int) – Number of attempts to connect to the server.
- Returns
Paramiko SFTPClient connection. If no connection can be made, None is returned.
- Return type
SFTPClient
- pdm_utils.functions.server.upload_file(sftp, local_filepath, remote_filepath)¶
Upload a file to the server.
- Parameters
sftp (SFTPClient) – Paramiko SFTPClient connection to a server.
local_filepath (str) – Absoluate path to file to be uploaded.
remote_filepath (str) – Absoluate path to server destination.
- Returns
Indicates whether upload was successful.
- Return type
bool
pdm_utils.functions.tickets module¶
Misc. functions to manipulate tickets.
- pdm_utils.functions.tickets.construct_tickets(list_of_data_dict, eval_data_dict, description_field, required_keys, optional_keys, keywords)¶
Construct pdm_utils ImportTickets from parsed data dictionaries.
- Parameters
list_of_data_dict (list) – List of import ticket data dictionaries.
eval_data_dict (dict) – Dictionary of boolean evaluation flags.
description_field (str) – Default value to set ticket.description_field attribute if not present in the data dictionary.
required_keys (set) – Set of keys required to be in the data dictionary.
optional_keys (set) – Set of optional keys that are not required to be in the data dictionary.
keywords (set) – Set of valid keyword values that are handled differently than other values.
- Returns
List of pdm_utils ImportTicket objects.
- Return type
list
- pdm_utils.functions.tickets.get_genome(tkt, gnm_type='')¶
Construct a pdm_utils Genome object from a pdm_utils ImportTicket object.
- Parameters
tkt (ImportTicket) – A pdm_utils ImportTicket object.
gnm_type (str) – Identifier for the type of genome.
- Returns
A pdm_utils Genome object.
- Return type
- pdm_utils.functions.tickets.identify_duplicates(list_of_tickets, null_set={})¶
Compare all tickets to each other to identify ticket conflicts.
Identifies if the same id, PhageID, and Accession is present in multiple tickets.
- Parameters
list_of_tickets (list) – A list of pdm_utils ImportTicket objects.
null_set (set) – A set of values that may be expected to be duplicated, that should not throw errors.
- Returns
tuple (tkt_id_dupes, phage_id_dupes) WHERE tkt_id_dupes(set) is a set of duplicate ticket ids. phage_id_dupes(set) is a set of duplicate PhageIDs.
- Return type
tuple
- pdm_utils.functions.tickets.modify_import_data(data_dict, required_keys, optional_keys, keywords)¶
Modifies ticket data to conform to requirements for an ImportTicket object.
- Parameters
data_dict (dict) – Dictionary of import ticket data.
required_keys (set) – Set of keys required to be in the data dictionary.
optional_keys (set) – Set of optional keys that are not required to be in the data dictionary.
keywords (set) – Set of valid keyword values that are handled differently than other values.
- Returns
Indicates if the ticket is structured properly.
- Return type
bool
- pdm_utils.functions.tickets.parse_import_ticket_data(data_dict)¶
Converts import ticket data to a ImportTicket object.
- Parameters
data_dict (dict) –
A dictionary of data with the following keys:
Import action type
Primary PhageID
Host
Cluster
Subcluster
Status
Annotation Author (int)
Feature field
Accession
Retrieve Record (int)
Evaluation mode
- Returns
A pdm_utils ImportTicket object.
- Return type
- pdm_utils.functions.tickets.set_dict_value(data_dict, key, first, second)¶
Set the value for a specific key based on ‘type’ key-value.
- Parameters
data_dict (dict) – Dictionary of import ticket data.
key (str) – Dictionary key to change value of.
first (str) – Value to assign to ‘key’ if ‘type’ == ‘add’.
second (str) – Value to assign to ‘key’ if ‘type’ != ‘add’.
- pdm_utils.functions.tickets.set_empty(data_dict)¶
Convert None values to an empty string.
- Parameters
data_dict (dict) – Dictionary of import ticket data.
- pdm_utils.functions.tickets.set_keywords(data_dict, keywords)¶
Convert specific values in a dictionary to lowercase.
- Parameters
data_dict (dict) – Dictionary of import ticket data.
keywords (set) – Set of valid keyword values that are handled differently than other values.
- pdm_utils.functions.tickets.set_missing_keys(data_dict, expected_keys)¶
Add a list of keys-values to a dictionary if it doesn’t have those keys.
- Parameters
data_dict (dict) – Dictionary of import ticket data.
expected_keys (set) – Set of keys expected to be in the dictionary.