flat_files

Functions to interact with, use, and parse genomic data from GenBank-formatted flat files.

pdm_utils.functions.flat_files.cds_to_seqrecord(cds, parent_genome, gene_domains=[], desc_type='gb')

Creates a SeqRecord object from a Cds and its parent Genome.

Parameters
  • cds (Cds) – A populated Cds object.

  • phage_genome – Populated parent Genome object of the Cds object.

  • domains (list) – List of domain objects populated with column attributes

  • desc_type (str) – Inteneded format of the CDS SeqRecord description.

Returns

Filled Biopython SeqRecord object.

Return type

SeqRecord

pdm_utils.functions.flat_files.create_fasta_seqrecord(header, sequence_string)

Create a fasta-formatted Biopython SeqRecord object.

Parameters
  • header (str) – Description of the sequence.

  • sequence_string (str) – Nucleotide sequence.

Returns

Biopython SeqRecord containing the nucleotide sequence.

Return type

SeqRecord

pdm_utils.functions.flat_files.create_seqfeature_dictionary(seqfeature_list)

Create a dictionary of Biopython SeqFeature objects based on their type.

From a list of all Biopython SeqFeatures derived from a GenBank-formatted flat file, create a dictionary of SeqFeatures based on their ‘type’ attribute.

Parameters
  • seqfeature_list (list) – List of Biopython SeqFeatures

  • genome_id (str) – An identifier for the genome in which the seqfeature is defined.

Returns

A dictionary of Biopython SeqFeatures: Key: SeqFeature type (source, tRNA, CDS, other) Value: SeqFeature

Return type

dict

pdm_utils.functions.flat_files.format_cds_seqrecord_CDS_feature(cds_feature, cds, parent_genome)
pdm_utils.functions.flat_files.genome_to_seqrecord(phage_genome)

Creates a SeqRecord object from a pdm_utils Genome object.

Parameters

phage_genome (Genome) – A pdm_utils Genome object.

Returns

A BioPython SeqRecord object

Return type

SeqRecord

pdm_utils.functions.flat_files.get_cds_seqrecord_annotations(cds, parent_genome)

Function that creates a Cds SeqRecord annotations attribute dict. :param cds: A populated Cds object. :type cds: Cds :param phage_genome: Populated parent Genome object of the Cds object. :type phage_genome: Genome :returns: Formatted SeqRecord annotations dictionary. :rtype: dict{str}

pdm_utils.functions.flat_files.get_cds_seqrecord_annotations_comments(cds)

Function that creates a Cds SeqRecord comments attribute tuple.

Parameters

cds

pdm_utils.functions.flat_files.get_cds_seqrecord_regions(gene_domains, cds)
pdm_utils.functions.flat_files.get_genome_seqrecord_annotations(phage_genome)

Helper function that uses Genome data to populate the annotations SeqRecord attribute

Parameters

phage_genome (genome) – Input a Genome object.

Returns

annotations(dictionary) is a dictionary with the formatting of BioPython’s SeqRecord annotations attribute

pdm_utils.functions.flat_files.get_genome_seqrecord_annotations_comments(phage_genome)

Helper function that uses Genome data to populate the comment annotation attribute

Parameters

phage_genome (genome) – Input a Genome object.

Returns

cluster_comment, auto_generated_comment annotation_status_comment, qc_and_retrieval values (tuple) is a tuple with the formatting of BioPython’s SeqRecord annotations comment attribute

pdm_utils.functions.flat_files.get_genome_seqrecord_description(phage_genome)

Helper function to construct a description SeqRecord attribute.

Parameters

phage_genome (genome) – Input a Genome object.

Returns

description is a formatted string parsed from genome data

pdm_utils.functions.flat_files.get_genome_seqrecord_features(phage_genome)

Helper function that uses Genome data to populate the features SeqRecord atribute

Parameters

phage_genome (genome) – Input a Genome object.

Returns

features is a list of SeqFeature objects parsed from cds objects

pdm_utils.functions.flat_files.parse_cds_seqfeature(seqfeature)

Parse data from a Biopython CDS SeqFeature object into a Cds object.

Parameters
  • seqfeature (SeqFeature) – Biopython SeqFeature

  • genome_id (str) – An identifier for the genome in which the seqfeature is defined.

Returns

A pdm_utils Cds object

Return type

Cds

pdm_utils.functions.flat_files.parse_coordinates(seqfeature)

Parse the boundary coordinates from a GenBank-formatted flat file.

The functions takes a Biopython SeqFeature object containing data that was parsed from the feature in the flat file. Parsing these coordinates can be tricky. There can be more than one set of coordinates if it is a compound location. Only features with 1 or 2 open reading frames (parts) are correctly parsed. Also, the boundaries may not be precise; instead they may be open or fuzzy. Non-precise coordinates are converted to ‘-1’. If the strand is undefined, the coordinates are converted to ‘-1’ and parts is set to ‘0’. If an incorrect data type is provided, coorindates are set to ‘-1’ and parts is set to ‘0’.

Parameters

seqfeature (SeqFeature) – Biopython SeqFeature

Returns

tuple (start, stop, parts) WHERE start(int) is the first coordinate, regardless of strand. stop(int) is the second coordinate, regardless of strand. parts(int) is the number of open reading frames that define the feature.

pdm_utils.functions.flat_files.parse_genome_data(seqrecord, filepath=PosixPath('.'), translation_table=11, genome_id_field='_organism_name', gnm_type='', host_genus_field='_organism_host_genus')

Parse data from a Biopython SeqRecord object into a Genome object.

All Source, CDS, tRNA, and tmRNA features are parsed into their associated Source, Cds, Trna, and Tmrna objects.

Parameters
  • seqrecord (SeqRecord) – A Biopython SeqRecord object.

  • filepath (Path) – A filename associated with the returned Genome object.

  • translation_table (int) – The applicable translation table for the genome’s CDS features.

  • genome_id_field (str) – The SeqRecord attribute from which the unique genome identifier/name is stored.

  • host_genus_field (str) – The SeqRecord attribute from which the unique host genus identifier/name is stored.

  • gnm_type (str) – Identifier for the type of genome.

Returns

A pdm_utils Genome object.

Return type

Genome

pdm_utils.functions.flat_files.parse_source_seqfeature(seqfeature)

Parses a Biopython Source SeqFeature.

Parameters
  • seqfeature (SeqFeature) – Biopython SeqFeature

  • genome_id (str) – An identifier for the genome in which the seqfeature is defined.

Returns

A pdm_utils Source object

Return type

Source

pdm_utils.functions.flat_files.parse_tmrna_seqfeature(seqfeature)

Parses data from a BioPython tmRNA SeqFeature object into a Tmrna object. :param seqfeature: BioPython SeqFeature :type seqfeature: SeqFeature :return: pdm_utils Tmrna object :rtype: Tmrna

pdm_utils.functions.flat_files.parse_trna_seqfeature(seqfeature)

Parse data from a Biopython tRNA SeqFeature object into a Trna object. :param seqfeature: Biopython SeqFeature :type seqfeature: SeqFeature :returns: a pdm_utils Trna object :rtype: Trna

pdm_utils.functions.flat_files.retrieve_genome_data(filepath)

Retrieve data from a GenBank-formatted flat file.

Parameters

filepath (Path) – Path to GenBank-formatted flat file that will be parsed using Biopython.

Returns

If there is only one record, a Biopython SeqRecord of parsed data. If the file cannot be parsed, or if there are multiple records, None value is returned.

Return type

SeqRecord

pdm_utils.functions.flat_files.sort_seqrecord_features(seqrecord)

Function that sorts and processes the seqfeature objects of a seqrecord.

Parameters

seqrecord (SeqRecord) – Phage genome Biopython seqrecord object