genome

Represents a collection of data about a genome that are commonly used to maintain and update SEA-PHAGES phage genomics data.

class pdm_utils.classes.genome.Genome

Bases: object

Class to hold data about a phage genome.

check_attribute(attribute, check_set, expect=False, eval_id=None, success='correct', fail='error', eval_def=None)

Check that the attribute value is valid.

Parameters
  • attribute (str) – Name of the Genome object attribute to evaluate.

  • check_set (set) – Set of reference ids.

  • expect (bool) – Indicates whether the attribute value is expected to be present in the check set.

  • eval_id (str) – Unique identifier for the evaluation.

  • success (str) – Default status if the outcome is a success.

  • fail (str) – Default status if the outcome is not a success.

  • eval_def (str) – Description of the evaluation.

check_authors(check_set={}, expect=True, eval_id=None, success='correct', fail='error', eval_def=None)

Check author list.

Evaluates whether at least one author in the in the list of authors is present in a set of reference authors.

Parameters
  • check_set (set) – Set of reference authors.

  • expect (bool) – Indicates whether at least one author in the list of authors is expected to be present in the check set.

  • eval_id – same as for check_attribute().

  • success – same as for check_attribute().

  • fail – same as for check_attribute().

  • eval_def – same as for check_attribute().

check_cds_end_orient_ids(eval_id=None, success='correct', fail='error', eval_def=None)

Check if there are any duplicate transcription end-orientation coordinates.

Duplicated transcription end-orientation coordinates may represent unintentional duplicate CDS features with slightly different start coordinates.

Parameters
  • eval_id – same as for check_attribute().

  • success – same as for check_attribute().

  • fail – same as for check_attribute().

  • eval_def – same as for check_attribute().

check_cds_start_end_ids(eval_id=None, success='correct', fail='error', eval_def=None)

Check if there are any duplicate start-end coordinates.

Duplicated start-end coordinates may represent unintentional duplicate CDS features.

Parameters
  • eval_id – same as for check_attribute().

  • success – same as for check_attribute().

  • fail – same as for check_attribute().

  • eval_def – same as for check_attribute().

check_cluster_structure(eval_id=None, success='correct', fail='error', eval_def=None)

Check whether the cluster attribute is structured appropriately.

Parameters
  • eval_id – same as for check_attribute().

  • success – same as for check_attribute().

  • fail – same as for check_attribute().

  • eval_def – same as for check_attribute().

check_compatible_cluster_and_subcluster(eval_id=None, success='correct', fail='error', eval_def=None)

Check compatibility of cluster and subcluster attributes.

Parameters
  • eval_id – same as for check_attribute().

  • success – same as for check_attribute().

  • fail – same as for check_attribute().

  • eval_def – same as for check_attribute().

check_feature_coordinates(use_cds=False, use_trna=False, use_tmrna=False, other=None, strand=False, eval_id=None, success='correct', fail='error', eval_def=None)

Identify nested, duplicated, or partially-duplicated features.

Parameters
  • use_cds (bool) – Indicates whether ids for CDS features should be generated.

  • use_trna (bool) – Indicates whether ids for tRNA features should be generated.

  • use_tmrna (bool) – Indicates whether ids for tmRNA features should be generated.

  • other (list) – List of features that should be included.

  • strand (bool) – Indicates if feature orientation should be included.

  • eval_id – same as for check_attribute().

  • success – same as for check_attribute().

  • fail – same as for check_attribute().

  • eval_def – same as for check_attribute().

check_magnitude(attribute, expect, ref_value, eval_id=None, success='correct', fail='error', eval_def=None)

Check that the magnitude of a numerical attribute is valid.

Parameters
  • attribute – same as for check_attribute().

  • expect (str) – Comparison symbol indicating direction of magnitude (>, =, <).

  • ref_value (int, float, datetime) – Numerical value for comparison.

  • eval_id – same as for check_attribute().

  • success – same as for check_attribute().

  • fail – same as for check_attribute().

  • eval_def – same as for check_attribute().

check_nucleotides(check_set={}, eval_id=None, success='correct', fail='error', eval_def=None)

Check if all nucleotides in the sequence are expected.

Parameters
  • check_set (set) – Set of reference nucleotides.

  • eval_id – same as for check_attribute().

  • success – same as for check_attribute().

  • fail – same as for check_attribute().

  • eval_def – same as for check_attribute().

check_subcluster_structure(eval_id=None, success='correct', fail='error', eval_def=None)

Check whether the subcluster attribute is structured appropriately.

Parameters
  • eval_id – same as for check_attribute().

  • success – same as for check_attribute().

  • fail – same as for check_attribute().

  • eval_def – same as for check_attribute().

clear_locus_tags()

Resets locus_tags to empty string.

compare_two_attributes(attribute1, attribute2, expect_same=False, eval_id=None, success='correct', fail='error', eval_def=None)

Determine if two attributes are the same.

Parameters
  • attribute1 (str) – First attribute to compare.

  • attribute2 (str) – Second attribute to compare.

  • expect_same (bool) – Indicates whether the two attribute values are expected to be the same.

  • eval_id – same as for check_attribute().

  • success – same as for check_attribute().

  • fail – same as for check_attribute().

  • eval_def – same as for check_attribute().

parse_description()

Retrieve the name and host_genus from the ‘description’ attribute.

parse_organism()

Retrieve the name and host_genus from the ‘organism’ attribute.

parse_source()

Retrieve the name and host_genus from the ‘source’ attribute.

set_accession(value, format='empty_string')

Set the accession.

The Accession field in the MySQL database defaults to ‘’. Some flat file accessions have the version number suffix, so discard the version number.

Parameters
  • value (str) – GenBank accession number.

  • format (misc.) – indicates the format of the data if it is not a valid accession. Default is ‘’.

set_annotation_author(value)

Convert annotation_author to integer value if possible.

Parameters

value (str, int) – Numeric value.

set_cds_descriptions(value)

Set each CDS processed description as indicated.

Parameters

value (str) – Name of the description field.

set_cds_features(value)

Set and tally the CDS features.

Parameters

value (list) – list of Cds objects.

set_cds_id_list()

Creates lists of CDS feature identifiers.

The first identifier is derived from the start and end coordinates. The second identifier is derived from the transcription end coordinate and orientation.

set_cluster(value)

Set the cluster and modify singleton if needed.

Parameters

value (str) – Cluster designation of the genome.

set_date(value, format='empty_datetime_obj')

Set the date attribute.

Parameters
  • value (misc) – Date

  • format (str) – Indicates the format if the value is empty.

set_eval(eval_id, definition, result, status)

Constructs and adds an Evaluation object to the evaluations list.

Parameters
  • eval_id (str) – Unique identifier for the evaluation.

  • definition (str) – Description of the evaluation.

  • result (str) – Description of the outcome of the evaluation.

  • status (str) – Outcome of the evaluation.

set_feature_genome_ids(use_cds=False, use_trna=False, use_tmrna=False, use_source=False, value=None)

Sets the genome_id of each feature.

Parameters
  • use_cds (bool) – Indicates whether genome_id for CDS features should be set.

  • use_trna (bool) – Indicates whether genome_id for tRNA features should be set.

  • use_tmrna (bool) – Indicates whether genome_id for tmRNA features should be set.

  • use_source (bool) – Indicates whether genome_id for source features should be set.

  • value (str) – Genome identifier.

set_feature_ids(use_type=False, use_cds=False, use_trna=False, use_tmrna=False, use_source=False)

Sets the id of each feature.

Lists of features can be added to this method. The method assumes that all elements in all lists contain ‘id’, ‘start’, and ‘stop’ attributes. This feature attribute is processed within the Genome object because and not within the feature itself since the method sorts all features and generates systematic IDs based on feature order in the genome.

Parameters
  • use_type (bool) – Indicates whether the type of object should be added to the feature id.

  • use_cds (bool) – Indicates whether ids for CDS features should be generated.

  • use_trna (bool) – Indicates whether ids for tRNA features should be generated.

  • use_tmrna (bool) – Indicates whether ids for tmRNA features should be generated.

  • use_source (bool) – Indicates whether ids for source features should be generated.

set_filename(filepath)

Set the filename. Discard the path and file extension.

Parameters

filepath (Path) – name of the file reference.

set_host_genus(value=None, attribute=None, format='empty_string')

Set the host_genus from a value parsed from the indicated attribute.

The input data is split into multiple parts, and the first word is used to set host_genus.

Parameters
  • value (str) – the host genus of the phage genome

  • attribute (str) – the name of the genome attribute from which the host_genus attribute will be set

  • format (str) – the default format if the input is an empty/null value.

set_id(value=None, attribute=None)

Set the id from either an input value or an indicated attribute.

Parameters
  • value (str) – unique identifier for the genome.

  • attribute (str) – name of a genome object attribute that stores a unique identifier for the genome.

set_retrieve_record(value)

Convert retrieve_record to integer value if possible.

Parameters

value (str, int) – Numeric value.

set_sequence(value)

Set the nucleotide sequence and compute the length.

This method coerces sequences into a Biopython Seq object.

Parameters

value (str or Seq) – the genome’s nucleotide sequence.

set_source_features(value)

Set and tally the source features.

Parameters

value (list) – list of Source objects.

set_subcluster(value)

Set the subcluster.

Parameters

value (str) – Subcluster designation of the genome.

set_tmrna_features(value)

Set and tally the tmRNA features. :param value: list of Tmrna objects. :type value: list

set_trna_features(value)

Set and tally the tRNA features.

Parameters

value (list) – list of Trna objects.

set_unique_cds_end_orient_ids()

Identify CDS features contain unique transcription end-orientation coordinates.

set_unique_cds_start_end_ids()

Identify CDS features contain unique start-end coordinates.

tally_cds_descriptions()

Tally the non-generic CDS descriptions.

update_name_and_id(value)

Update the genome name and id in all locations in a Genome object.

Parameters
  • gnm (Genome) – A pdm_utils Genome object.

  • value (str) – Value used to update the Genome id and name.