cds

Represents a collection of data about a CDS features that are commonly used to maintain and update SEA-PHAGES phage genomics data.

class pdm_utils.classes.cds.Cds

Bases: object

Class to hold data about a CDS feature.

check_amino_acids(check_set={}, eval_id=None, success='correct', fail='error', eval_def=None)

Check whether all amino acids in the translation are valid.

Parameters
  • check_set (set) – Set of valid amino acids.

  • eval_id – same as for check_attribute().

  • success – same as for check_attribute().

  • fail – same as for check_attribute().

  • eval_def – same as for check_attribute().

check_attribute(attribute, check_set, expect=False, eval_id=None, success='correct', fail='error', eval_def=None)

Check that the attribute value is valid.

Parameters
  • attribute (str) – Name of the CDS object attribute to evaluate.

  • check_set (set) – Set of reference ids.

  • expect (bool) – Indicates whether the attribute value is expected to be present in the check set.

  • eval_id (str) – Unique identifier for the evaluation.

  • success (str) – Default status if the outcome is a success.

  • fail (str) – Default status if the outcome is not a success.

  • eval_def (str) – Description of the evaluation.

check_compatible_gene_and_locus_tag(eval_id=None, success='correct', fail='error', eval_def=None)

Check if gene and locus_tag attributes contain identical numbers.

Parameters
  • eval_id – same as for check_attribute().

  • success – same as for check_attribute().

  • fail – same as for check_attribute().

  • eval_def – same as for check_attribute().

check_description_field(attribute='product', eval_id=None, success='correct', fail='error', eval_def=None)

Check if there are CDS descriptions in unexpected fields.

Evaluates whether the indicated attribute is empty or generic, and other fields contain non-generic data.

Parameters
  • attribute (str) – Indicates the reference attribute for the evaluation (‘product’, ‘function’, ‘note’).

  • eval_id – same as for check_attribute().

  • success – same as for check_attribute().

  • fail – same as for check_attribute().

  • eval_def – same as for check_attribute().

check_gene_structure(eval_id=None, success='correct', fail='error', eval_def=None)

Check if the gene qualifier contains an integer.

Parameters
  • eval_id – same as for check_attribute().

  • success – same as for check_attribute().

  • fail – same as for check_attribute().

  • eval_def – same as for check_attribute().

check_generic_data(attribute=None, eval_id=None, success='correct', fail='error', eval_def=None)

Check if the indicated attribute contains generic data.

Parameters
  • attribute (str) – Indicates the attribute for the evaluation (‘product’, ‘function’, ‘note’).

  • eval_id – same as for check_attribute().

  • success – same as for check_attribute().

  • fail – same as for check_attribute().

  • eval_def – same as for check_attribute().

check_locus_tag_structure(check_value=None, only_typo=False, prefix_set={}, case=True, eval_id=None, success='correct', fail='error', eval_def=None)

Check if the locus_tag is structured correctly.

Parameters
  • check_value (str) – Indicates the genome id that is expected to be present. If None, the ‘genome_id’ parameter is used.

  • only_typo (bool) – Indicates if only the genome id spelling should be evaluated.

  • prefix_set (set) – Indicates valid common prefixes, if a prefix is expected.

  • case (bool) – Indicates whether the locus_tag is expected to be capitalized.

  • eval_id – same as for check_attribute().

  • success – same as for check_attribute().

  • fail – same as for check_attribute().

  • eval_def – same as for check_attribute().

check_magnitude(attribute, expect, ref_value, eval_id=None, success='correct', fail='error', eval_def=None)

Check that the magnitude of a numerical attribute is valid.

Parameters
  • attribute – same as for check_attribute().

  • expect (str) – Comparison symbol indicating direction of magnitude (>, =, <).

  • ref_value (int, float, datetime) – Numerical value for comparison.

  • eval_id – same as for check_attribute().

  • success – same as for check_attribute().

  • fail – same as for check_attribute().

  • eval_def – same as for check_attribute().

check_orientation(format='fr_short', case=True, eval_id=None, success='correct', fail='error', eval_def=None)

Check if orientation is set appropriately.

Relies on the reformat_strand function to manage orientation data.

Parameters
  • format (str) – Indicates how coordinates should be formatted.

  • case (bool) – Indicates whether the orientation data should be cased.

  • eval_id – same as for check_attribute().

  • success – same as for check_attribute().

  • fail – same as for check_attribute().

  • eval_def – same as for check_attribute().

check_translation(eval_id=None, success='correct', fail='error', eval_def=None)

Check that the current and expected translations match.

Parameters
  • eval_id – same as for check_attribute().

  • success – same as for check_attribute().

  • fail – same as for check_attribute().

  • eval_def – same as for check_attribute().

create_seqfeature(type, start, stop, strand)
get_begin_end()

Get feature coordinates in transcription begin-end format.

Returns

(Begin, End) Start and stop coordinates ordered by which coordinate indicates the transcriptional beginning and end of the feature.

Return type

tuple

get_qualifiers(type)

Helper function that uses cds data to populate the qualifiers SeqFeature attribute

Returns

qualifiers(dictionary) is a dictionary with the formating of BioPython’s SeqFeature qualifiers attribute.

reformat_start_and_stop(new_format)

Convert start and stop coordinates to new coordinate format. This also updates the coordinate format attribute to reflect change.

Relies on the reformat_coordinates function.

Parameters

new_format (str) – Indicates how coordinates should be formatted.

set_description(value)

Set the primary raw and processed description attributes.

Parameters

value (str) – Indicates which reference attributes are used to set the attributes (‘product’, ‘function’, ‘note’).

set_description_field(attr, description, delimiter=None, prefix_set=None)

Set a description attribute parsed from a description.

Parameters
  • attr (str) – Attribute to set the description.

  • description (str) – Description data to parse. Also passed to set_num().

  • delimiter (str) – Passed to set_num().

  • prefix_set (set) – Passed to set_num().

set_eval(eval_id, definition, result, status)

Constructs and adds an Evaluation object to the evaluations list.

Parameters
  • eval_id (str) – Unique identifier for the evaluation.

  • definition (str) – Description of the evaluation.

  • result (str) – Description of the outcome of the evaluation.

  • status (str) – Outcome of the evaluation.

set_gene(value, delimiter=None, prefix_set=None)

Set the gene attribute.

Parameters
  • value (str) – Gene data to parse. Also passed to set_num().

  • delimiter (str) – Passed to set_num().

  • prefix_set (set) – Passed to set_num().

set_location_id()

Create a tuple of feature location data.

For start and stop coordinates of the feature, it doesn’t matter whether the feature is complex with a translational frameshift or not. Retrieving the “start” and “stop” boundary attributes return the very beginning and end of the feature, disregarding the inner “join” coordinates. If only the feature transcription “end” coordinate is used, orientation information is required. If transcription “begin” and “end” coordinates are used instead of “start” and “stop” coordinates, no orientation information is required.

set_locus_tag(tag='', delimiter='_', check_value=None)

Set locus tag and parse the locus_tag feature number.

Parameters
  • tag (str) – Input locus_tag data.

  • delimiter (str) – Value used to split locus_tag data.

  • check_value (str) – Indicates genome name or other value that will be used to parse the locus_tag to identify the feature number. If no check_value is provided, the genome_id attribute is used.

set_name(value=None)

Set the feature name.

Ideally, the name of the CDS will be an integer. This information can be stored in multiple fields in the GenBank-formatted flat file. The name is derived from one of several qualifiers.

Parameters

value (str) – Indicates a value that should be used to directly set the name regardless of the ‘gene’ and ‘_locus_tag_num’ attributes.

set_nucleotide_length(seq=False, translation=False)

Set the length of the nucleotide sequence.

Nucleotide length can be computed several different ways, including from the difference of the start and stop coordinates, the length of the transcribed nucleotide sequence, or the length of the translation. For compound features, using either the nucleotide or translation sequence is the accurate way to determine the true length of the feature, but ‘length’ may mean different things in different contexts.

Parameters
  • seq (bool) – Use the nucleotide sequence from the ‘seq’ attribute to compute the length.

  • translation (bool) – Use the translation sequence from the ‘translation’ attribute to compute the length.

set_nucleotide_sequence(value=None, parent_genome_seq=None)

Set the nucleotide sequence of the feature.

This method can directly set the attribute from a supplied ‘value’, or it can retrieve the sequence from the parent genome using Biopython. In this latter case, it relies on a Biopython SeqFeature object for the sequence extraction method and coordinates. If this object was generated from a Biopython-parsed GenBank-formatted flat file, the coordinates are by default ‘0-based half-open’, the object contains coordinates for every part of the feature (e.g. if it is a compound feature) and fuzzy locations. As a result, the length of the retrieved sequence may not exactly match the length indicated from the ‘start’ and ‘stop’ coordinates. If the nucleotide sequence ‘value’ is provided, the ‘parent_genome_seq’ does not impact the result.

Parameters
  • value (str of Seq) – Input nucleotide sequence

  • parent_genome_seq (Seq) – Input parent genome nucleotide sequence.

set_num(attr, description, delimiter=None, prefix_set=None)

Set a number attribute from a description.

Parameters
  • attr (str) – Attribute to set the number.

  • description (str) – Description data from which to parse the number.

  • delimiter (str) – Value used to split the description data.

  • prefix_set (set) – Valid possible delimiters in the description.

set_orientation(value, format, case=False)

Sets orientation based on indicated format.

Relies on the reformat_strand function to manage orientation data.

Parameters
  • value (misc.) – Input orientation value.

  • format (str) – Indicates how the orientation data should be formatted.

  • case (bool) – Indicates whether the output orientation data should be cased.

set_seqfeature(type='CDS')

Set the ‘seqfeature’ attribute.

The ‘seqfeature’ attribute stores a Biopython SeqFeature object, which contains methods valuable to extracting sequence data relevant to the feature.

set_translation(value=None, translate=False)

Set translation and its length.

The translation is coerced into a Biopython Seq object. If no input translation value is provided, the translation is generated from the parent genome nucleotide sequence. If an input translation value is provided, the ‘translate’ parameter has no impact.

Parameters
  • value (str or Seq) – Amino acid sequence

  • translate (bool) – Indicates whether the translation should be generated from the parent genome nucleotide sequence.

set_translation_table(value)

Set translation table integer.

Parameters

value (int) – Translation table that should be used to generate the translation.

translate_seq()

Translate the CDS nucleotide sequence.

Use Biopython to translate the nucleotide sequece. The method expects the nucleotide sequence to be a valid CDS sequence in which:

  1. it begins with a valid start codon,

  2. it ends with a stop codon,

  3. it contains only one stop codon,

  4. its length is divisible by 3,

  5. it translates non-standard start codons to methionine.

If these criteria are not met, an empty Seq object is returned.

Returns

Amino acid sequence

Return type

Seq