flat_files¶
Functions to interact with, use, and parse genomic data from GenBank-formatted flat files.
- pdm_utils.functions.flat_files.cds_to_seqrecord(cds, parent_genome, gene_domains=[], desc_type='gb')¶
Creates a SeqRecord object from a Cds and its parent Genome.
- Parameters
cds (Cds) – A populated Cds object.
phage_genome – Populated parent Genome object of the Cds object.
domains (list) – List of domain objects populated with column attributes
desc_type (str) – Inteneded format of the CDS SeqRecord description.
- Returns
Filled Biopython SeqRecord object.
- Return type
SeqRecord
- pdm_utils.functions.flat_files.create_fasta_seqrecord(header, sequence_string)¶
Create a fasta-formatted Biopython SeqRecord object.
- Parameters
header (str) – Description of the sequence.
sequence_string (str) – Nucleotide sequence.
- Returns
Biopython SeqRecord containing the nucleotide sequence.
- Return type
SeqRecord
- pdm_utils.functions.flat_files.create_seqfeature_dictionary(seqfeature_list)¶
Create a dictionary of Biopython SeqFeature objects based on their type.
From a list of all Biopython SeqFeatures derived from a GenBank-formatted flat file, create a dictionary of SeqFeatures based on their ‘type’ attribute.
- Parameters
seqfeature_list (list) – List of Biopython SeqFeatures
genome_id (str) – An identifier for the genome in which the seqfeature is defined.
- Returns
A dictionary of Biopython SeqFeatures: Key: SeqFeature type (source, tRNA, CDS, other) Value: SeqFeature
- Return type
dict
- pdm_utils.functions.flat_files.format_cds_seqrecord_CDS_feature(cds_feature, cds, parent_genome)¶
- pdm_utils.functions.flat_files.genome_to_seqrecord(phage_genome)¶
Creates a SeqRecord object from a pdm_utils Genome object.
- Parameters
phage_genome (Genome) – A pdm_utils Genome object.
- Returns
A BioPython SeqRecord object
- Return type
SeqRecord
- pdm_utils.functions.flat_files.get_cds_seqrecord_annotations(cds, parent_genome)¶
Function that creates a Cds SeqRecord annotations attribute dict. :param cds: A populated Cds object. :type cds: Cds :param phage_genome: Populated parent Genome object of the Cds object. :type phage_genome: Genome :returns: Formatted SeqRecord annotations dictionary. :rtype: dict{str}
- pdm_utils.functions.flat_files.get_cds_seqrecord_annotations_comments(cds)¶
Function that creates a Cds SeqRecord comments attribute tuple.
- Parameters
cds –
- pdm_utils.functions.flat_files.get_cds_seqrecord_regions(gene_domains, cds)¶
- pdm_utils.functions.flat_files.get_genome_seqrecord_annotations(phage_genome)¶
Helper function that uses Genome data to populate the annotations SeqRecord attribute
- Parameters
phage_genome (genome) – Input a Genome object.
- Returns
annotations(dictionary) is a dictionary with the formatting of BioPython’s SeqRecord annotations attribute
- pdm_utils.functions.flat_files.get_genome_seqrecord_annotations_comments(phage_genome)¶
Helper function that uses Genome data to populate the comment annotation attribute
- Parameters
phage_genome (genome) – Input a Genome object.
- Returns
cluster_comment, auto_generated_comment annotation_status_comment, qc_and_retrieval values (tuple) is a tuple with the formatting of BioPython’s SeqRecord annotations comment attribute
- pdm_utils.functions.flat_files.get_genome_seqrecord_description(phage_genome)¶
Helper function to construct a description SeqRecord attribute.
- Parameters
phage_genome (genome) – Input a Genome object.
- Returns
description is a formatted string parsed from genome data
- pdm_utils.functions.flat_files.get_genome_seqrecord_features(phage_genome)¶
Helper function that uses Genome data to populate the features SeqRecord atribute
- Parameters
phage_genome (genome) – Input a Genome object.
- Returns
features is a list of SeqFeature objects parsed from cds objects
- pdm_utils.functions.flat_files.parse_cds_seqfeature(seqfeature)¶
Parse data from a Biopython CDS SeqFeature object into a Cds object.
- Parameters
seqfeature (SeqFeature) – Biopython SeqFeature
genome_id (str) – An identifier for the genome in which the seqfeature is defined.
- Returns
A pdm_utils Cds object
- Return type
- pdm_utils.functions.flat_files.parse_coordinates(seqfeature)¶
Parse the boundary coordinates from a GenBank-formatted flat file.
The functions takes a Biopython SeqFeature object containing data that was parsed from the feature in the flat file. Parsing these coordinates can be tricky. There can be more than one set of coordinates if it is a compound location. Only features with 1 or 2 open reading frames (parts) are correctly parsed. Also, the boundaries may not be precise; instead they may be open or fuzzy. Non-precise coordinates are converted to ‘-1’. If the strand is undefined, the coordinates are converted to ‘-1’ and parts is set to ‘0’. If an incorrect data type is provided, coorindates are set to ‘-1’ and parts is set to ‘0’.
- Parameters
seqfeature (SeqFeature) – Biopython SeqFeature
- Returns
tuple (start, stop, parts) WHERE start(int) is the first coordinate, regardless of strand. stop(int) is the second coordinate, regardless of strand. parts(int) is the number of open reading frames that define the feature.
- pdm_utils.functions.flat_files.parse_genome_data(seqrecord, filepath=PosixPath('.'), translation_table=11, genome_id_field='_organism_name', gnm_type='', host_genus_field='_organism_host_genus')¶
Parse data from a Biopython SeqRecord object into a Genome object.
All Source, CDS, tRNA, and tmRNA features are parsed into their associated Source, Cds, Trna, and Tmrna objects.
- Parameters
seqrecord (SeqRecord) – A Biopython SeqRecord object.
filepath (Path) – A filename associated with the returned Genome object.
translation_table (int) – The applicable translation table for the genome’s CDS features.
genome_id_field (str) – The SeqRecord attribute from which the unique genome identifier/name is stored.
host_genus_field (str) – The SeqRecord attribute from which the unique host genus identifier/name is stored.
gnm_type (str) – Identifier for the type of genome.
- Returns
A pdm_utils Genome object.
- Return type
- pdm_utils.functions.flat_files.parse_source_seqfeature(seqfeature)¶
Parses a Biopython Source SeqFeature.
- Parameters
seqfeature (SeqFeature) – Biopython SeqFeature
genome_id (str) – An identifier for the genome in which the seqfeature is defined.
- Returns
A pdm_utils Source object
- Return type
- pdm_utils.functions.flat_files.parse_tmrna_seqfeature(seqfeature)¶
Parses data from a BioPython tmRNA SeqFeature object into a Tmrna object. :param seqfeature: BioPython SeqFeature :type seqfeature: SeqFeature :return: pdm_utils Tmrna object :rtype: Tmrna
- pdm_utils.functions.flat_files.parse_trna_seqfeature(seqfeature)¶
Parse data from a Biopython tRNA SeqFeature object into a Trna object. :param seqfeature: Biopython SeqFeature :type seqfeature: SeqFeature :returns: a pdm_utils Trna object :rtype: Trna
- pdm_utils.functions.flat_files.retrieve_genome_data(filepath)¶
Retrieve data from a GenBank-formatted flat file.
- Parameters
filepath (Path) – Path to GenBank-formatted flat file that will be parsed using Biopython.
- Returns
If there is only one record, a Biopython SeqRecord of parsed data. If the file cannot be parsed, or if there are multiple records, None value is returned.
- Return type
SeqRecord
- pdm_utils.functions.flat_files.sort_seqrecord_features(seqrecord)¶
Function that sorts and processes the seqfeature objects of a seqrecord.
- Parameters
seqrecord (SeqRecord) – Phage genome Biopython seqrecord object