genome¶

Represents a collection of data about a genome that are commonly used to maintain and update SEA-PHAGES phage genomics data.

class pdm_utils.classes.genome.Genome¶

Bases: object

Class to hold data about a phage genome.

check_attribute(attribute, check_set, expect=False, eval_id=None, success='correct', fail='error', eval_def=None)¶

Check that the attribute value is valid.

Parameters

attribute (str) – Name of the Genome object attribute to evaluate.
check_set (set) – Set of reference ids.
expect (bool) – Indicates whether the attribute value is expected to be present in the check set.
eval_id (str) – Unique identifier for the evaluation.
success (str) – Default status if the outcome is a success.
fail (str) – Default status if the outcome is not a success.
eval_def (str) – Description of the evaluation.

check_authors(check_set={}, expect=True, eval_id=None, success='correct', fail='error', eval_def=None)¶

Check author list.

Evaluates whether at least one author in the in the list of authors is present in a set of reference authors.

Parameters

check_set (set) – Set of reference authors.
expect (bool) – Indicates whether at least one author in the list of authors is expected to be present in the check set.
eval_id – same as for check_attribute().
success – same as for check_attribute().
fail – same as for check_attribute().
eval_def – same as for check_attribute().

check_cds_end_orient_ids(eval_id=None, success='correct', fail='error', eval_def=None)¶

Check if there are any duplicate transcription end-orientation coordinates.

Duplicated transcription end-orientation coordinates may represent unintentional duplicate CDS features with slightly different start coordinates.

Parameters

eval_id – same as for check_attribute().
success – same as for check_attribute().
fail – same as for check_attribute().
eval_def – same as for check_attribute().

check_cds_start_end_ids(eval_id=None, success='correct', fail='error', eval_def=None)¶

Check if there are any duplicate start-end coordinates.

Duplicated start-end coordinates may represent unintentional duplicate CDS features.

Parameters

eval_id – same as for check_attribute().
success – same as for check_attribute().
fail – same as for check_attribute().
eval_def – same as for check_attribute().

check_cluster_structure(eval_id=None, success='correct', fail='error', eval_def=None)¶

Check whether the cluster attribute is structured appropriately.

Parameters

eval_id – same as for check_attribute().
success – same as for check_attribute().
fail – same as for check_attribute().
eval_def – same as for check_attribute().

check_compatible_cluster_and_subcluster(eval_id=None, success='correct', fail='error', eval_def=None)¶

Check compatibility of cluster and subcluster attributes.

Parameters

eval_id – same as for check_attribute().
success – same as for check_attribute().
fail – same as for check_attribute().
eval_def – same as for check_attribute().

check_feature_coordinates(use_cds=False, use_trna=False, use_tmrna=False, other=None, strand=False, eval_id=None, success='correct', fail='error', eval_def=None)¶

Identify nested, duplicated, or partially-duplicated features.

Parameters

use_cds (bool) – Indicates whether ids for CDS features should be generated.
use_trna (bool) – Indicates whether ids for tRNA features should be generated.
use_tmrna (bool) – Indicates whether ids for tmRNA features should be generated.
other (list) – List of features that should be included.
strand (bool) – Indicates if feature orientation should be included.
eval_id – same as for check_attribute().
success – same as for check_attribute().
fail – same as for check_attribute().
eval_def – same as for check_attribute().

check_magnitude(attribute, expect, ref_value, eval_id=None, success='correct', fail='error', eval_def=None)¶

Check that the magnitude of a numerical attribute is valid.

Parameters

attribute – same as for check_attribute().
expect (str) – Comparison symbol indicating direction of magnitude (>, =, <).
ref_value (int, float, datetime) – Numerical value for comparison.
eval_id – same as for check_attribute().
success – same as for check_attribute().
fail – same as for check_attribute().
eval_def – same as for check_attribute().

check_nucleotides(check_set={}, eval_id=None, success='correct', fail='error', eval_def=None)¶

Check if all nucleotides in the sequence are expected.

Parameters

check_set (set) – Set of reference nucleotides.
eval_id – same as for check_attribute().
success – same as for check_attribute().
fail – same as for check_attribute().
eval_def – same as for check_attribute().

check_subcluster_structure(eval_id=None, success='correct', fail='error', eval_def=None)¶

Check whether the subcluster attribute is structured appropriately.

Parameters

eval_id – same as for check_attribute().
success – same as for check_attribute().
fail – same as for check_attribute().
eval_def – same as for check_attribute().

clear_locus_tags()¶: Resets locus_tags to empty string.

compare_two_attributes(attribute1, attribute2, expect_same=False, eval_id=None, success='correct', fail='error', eval_def=None)¶

Determine if two attributes are the same.

Parameters

attribute1 (str) – First attribute to compare.
attribute2 (str) – Second attribute to compare.
expect_same (bool) – Indicates whether the two attribute values are expected to be the same.
eval_id – same as for check_attribute().
success – same as for check_attribute().
fail – same as for check_attribute().
eval_def – same as for check_attribute().

parse_description()¶: Retrieve the name and host_genus from the ‘description’ attribute.

parse_organism()¶: Retrieve the name and host_genus from the ‘organism’ attribute.

parse_source()¶: Retrieve the name and host_genus from the ‘source’ attribute.

set_accession(value, format='empty_string')¶

Set the accession.

The Accession field in the MySQL database defaults to ‘’. Some flat file accessions have the version number suffix, so discard the version number.

Parameters

value (str) – GenBank accession number.
format (misc.) – indicates the format of the data if it is not a valid accession. Default is ‘’.

set_annotation_author(value)¶

Convert annotation_author to integer value if possible.

Parameters: value (str, int) – Numeric value.

set_cds_descriptions(value)¶

Set each CDS processed description as indicated.

Parameters: value (str) – Name of the description field.

set_cds_features(value)¶

Set and tally the CDS features.

Parameters: value (list) – list of Cds objects.

set_cds_id_list()¶

Creates lists of CDS feature identifiers.

The first identifier is derived from the start and end coordinates. The second identifier is derived from the transcription end coordinate and orientation.

set_cluster(value)¶

Set the cluster and modify singleton if needed.

Parameters: value (str) – Cluster designation of the genome.

set_date(value, format='empty_datetime_obj')¶

Set the date attribute.

Parameters

value (misc) – Date
format (str) – Indicates the format if the value is empty.

set_eval(eval_id, definition, result, status)¶

Constructs and adds an Evaluation object to the evaluations list.

Parameters

eval_id (str) – Unique identifier for the evaluation.
definition (str) – Description of the evaluation.
result (str) – Description of the outcome of the evaluation.
status (str) – Outcome of the evaluation.

set_feature_genome_ids(use_cds=False, use_trna=False, use_tmrna=False, use_source=False, value=None)¶

Sets the genome_id of each feature.

Parameters

use_cds (bool) – Indicates whether genome_id for CDS features should be set.
use_trna (bool) – Indicates whether genome_id for tRNA features should be set.
use_tmrna (bool) – Indicates whether genome_id for tmRNA features should be set.
use_source (bool) – Indicates whether genome_id for source features should be set.
value (str) – Genome identifier.

set_feature_ids(use_type=False, use_cds=False, use_trna=False, use_tmrna=False, use_source=False)¶

Sets the id of each feature.

Lists of features can be added to this method. The method assumes that all elements in all lists contain ‘id’, ‘start’, and ‘stop’ attributes. This feature attribute is processed within the Genome object because and not within the feature itself since the method sorts all features and generates systematic IDs based on feature order in the genome.

Parameters

use_type (bool) – Indicates whether the type of object should be added to the feature id.
use_cds (bool) – Indicates whether ids for CDS features should be generated.
use_trna (bool) – Indicates whether ids for tRNA features should be generated.
use_tmrna (bool) – Indicates whether ids for tmRNA features should be generated.
use_source (bool) – Indicates whether ids for source features should be generated.

set_filename(filepath)¶

Set the filename. Discard the path and file extension.

Parameters: filepath (Path) – name of the file reference.

set_host_genus(value=None, attribute=None, format='empty_string')¶

Set the host_genus from a value parsed from the indicated attribute.

The input data is split into multiple parts, and the first word is used to set host_genus.

Parameters

value (str) – the host genus of the phage genome
attribute (str) – the name of the genome attribute from which the host_genus attribute will be set
format (str) – the default format if the input is an empty/null value.

set_id(value=None, attribute=None)¶

Set the id from either an input value or an indicated attribute.

Parameters

value (str) – unique identifier for the genome.
attribute (str) – name of a genome object attribute that stores a unique identifier for the genome.

set_retrieve_record(value)¶

Convert retrieve_record to integer value if possible.

Parameters: value (str, int) – Numeric value.

set_sequence(value)¶

Set the nucleotide sequence and compute the length.

This method coerces sequences into a Biopython Seq object.

Parameters: value (str or Seq) – the genome’s nucleotide sequence.

set_source_features(value)¶

Set and tally the source features.

Parameters: value (list) – list of Source objects.

set_subcluster(value)¶

Set the subcluster.

Parameters: value (str) – Subcluster designation of the genome.

set_tmrna_features(value)¶: Set and tally the tmRNA features. :param value: list of Tmrna objects. :type value: list

set_trna_features(value)¶

Set and tally the tRNA features.

Parameters: value (list) – list of Trna objects.

set_unique_cds_end_orient_ids()¶: Identify CDS features contain unique transcription end-orientation coordinates.

set_unique_cds_start_end_ids()¶: Identify CDS features contain unique start-end coordinates.

tally_cds_descriptions()¶: Tally the non-generic CDS descriptions.

update_name_and_id(value)¶

Update the genome name and id in all locations in a Genome object.

Parameters

gnm (Genome) – A pdm_utils Genome object.
value (str) – Value used to update the Genome id and name.