phameration

Functions that are used in the phameration pipeline

pdm_utils.functions.phameration.blastp(index, chunk, tmp, db_path, evalue, query_cov)

Runs ‘blastp’ using the given chunk as the input gene set. The blast output is an adjacency matrix for this chunk. :param index: chunk index being run :type index: int :param chunk: the translations to run right now :type chunk: tuple of 2-tuples :param tmp: path where I/O can go on :type tmp: str :param db_path: path to the target blast database :type db_path: str :param evalue: e-value cutoff to report hits :type evalue: float

pdm_utils.functions.phameration.chunk_translations(translation_groups, chunksize=500)

Break translation_groups into a dictionary of chunksize-tuples of 2-tuples where each 2-tuple is a translation and its corresponding geneid. :param translation_groups: translations and their geneids :type translation_groups: dict :param chunksize: how many translations will be in a chunk? :type chunksize: int :return: chunks :rtype: dict

pdm_utils.functions.phameration.create_blastdb(fasta, db_name, db_path)

Runs ‘makeblastdb’ to create a BLAST-searchable database. :param fasta: FASTA-formatted input file :type fasta: str :param db_name: BLAST sequence database :type db_name: str :param db_path: BLAST sequence database path :type db_path: str

pdm_utils.functions.phameration.fix_colored_orphams(engine)

Find any single-member phams which are colored as though they are multi-member phams (not #FFFFFF in pham.Color). :param engine: sqlalchemy Engine allowing access to the database :return:

pdm_utils.functions.phameration.fix_white_phams(engine)

Find any phams with 2+ members which are colored as though they are orphams (#FFFFFF in pham.Color). :param engine: sqlalchemy Engine allowing access to the database :return:

pdm_utils.functions.phameration.get_geneids_and_translations(engine)

Constructs a dictionary mapping all geneids to their translations. :param engine: the Engine allowing access to the database :return: gs_to_ts

pdm_utils.functions.phameration.get_new_geneids(engine)

Queries the database for those genes that are not yet phamerated. :param engine: the Engine allowing access to the database :return: new_geneids

pdm_utils.functions.phameration.get_pham_colors(engine)

Queries the database for the colors of existing phams :param engine: the Engine allowing access to the database :return: pham_colors

pdm_utils.functions.phameration.get_pham_geneids(engine)

Queries the database for those genes that are already phamerated. :param engine: the Engine allowing access to the database :return: pham_geneids

pdm_utils.functions.phameration.get_translation_groups(engine)

Constructs a dictionary mapping all unique translations to their groups of geneids that share them :param engine: the Engine allowing access to the database :return: ts_to_gs

pdm_utils.functions.phameration.markov_cluster(adj_mat_file, inflation, tmp_dir)

Run ‘mcl’ on an adjacency matrix to cluster the blastp results. :param adj_mat_file: 3-column file with blastp resultant queries, subjects, and evalues :type adj_mat_file: str :param inflation: mcl inflation parameter :type inflation: float :param tmp_dir: file I/O directory :type tmp_dir: str :return: outfile :rtype: str

pdm_utils.functions.phameration.merge_pre_and_hmm_phams(hmm_phams, pre_phams, consensus_lookup)

Merges the pre-pham sequences (which contain all nr sequences) with the hmm phams (which contain only hmm consensus sequences) into the full hmm-based clustering output. Uses consensus_lookup dictionary to find the pre-pham that each consensus belongs to, and then adds each pre-pham geneid to a full pham based on the hmm phams. :param hmm_phams: clustered consensus sequences :type hmm_phams: dict :param pre_phams: clustered sequences (used to generate hmms) :type pre_phams: dict :param consensus_lookup: reverse-mapped pre_phams :type consensus_lookup: dict :return: phams :rtype: dict

pdm_utils.functions.phameration.mmseqs_clust(consensus_db, align_db, cluster_db)

Runs ‘mmseqs clust’ to cluster an MMseqs2 consensus database using an MMseqs2 alignment database, with results being saved to an MMseqs2 cluster database. :param consensus_db: MMseqs sequence database :type consensus_db: str :param align_db: MMseqs2 alignment database :type align_db: str :param cluster_db: MMseqs2 cluster database :type cluster_db: str

pdm_utils.functions.phameration.mmseqs_cluster(sequence_db, cluster_db, args)

Runs ‘mmseqs cluster’ to cluster an MMseqs2 sequence database. :param sequence_db: MMseqs2 sequence database :type sequence_db: str :param cluster_db: MMseqs2 clustered database :type cluster_db: str :param args: parsed command line arguments :type args: dict

pdm_utils.functions.phameration.mmseqs_createdb(fasta, sequence_db)

Runs ‘mmseqs createdb’ to convert a FASTA file into an MMseqs2 sequence database. :param fasta: path to the FASTA file to convert :type fasta: str :param sequence_db: MMseqs2 sequence database :type sequence_db: str

pdm_utils.functions.phameration.mmseqs_createseqfiledb(sequence_db, cluster_db, seqfile_db)

Runs ‘mmseqs createseqfiledb’ to create the intermediate to the FASTA-like parseable output. :param sequence_db: MMseqs2 sequence database :type sequence_db: str :param cluster_db: MMseqs2 clustered database :type cluster_db: str :param seqfile_db: MMseqs2 seqfile database :type seqfile_db: str

pdm_utils.functions.phameration.mmseqs_profile2consensus(profile_db, consensus_db)

Runs ‘mmseqs profile2consensus’ to extract consensus sequences from an MMseqs2 profile database, and creates an MMseqs2 sequence database from the consensuses. :param profile_db: MMseqs2 profile database :type profile_db: str :param consensus_db: MMseqs2 sequence database :type consensus_db: str

pdm_utils.functions.phameration.mmseqs_result2flat(query_db, target_db, seqfile_db, outfile)

Runs ‘mmseqs result2flat’ to create FASTA-like parseable output. :param query_db: MMseqs2 sequence or profile database :type query_db: str :param target_db: MMseqs2 sequence database :type target_db: str :param seqfile_db: MMseqs2 seqfile database :type seqfile_db: str :param outfile: FASTA-like parseable output :type outfile: str

pdm_utils.functions.phameration.mmseqs_result2profile(sequence_db, cluster_db, profile_db)

Runs ‘mmseqs result2profile’ to convert clusters from one MMseqs2 clustered database into a profile database. :param sequence_db: MMseqs2 sequence database :type sequence_db: str :param cluster_db: MMseqs2 clustered database :type cluster_db: str :param profile_db: MMseqs2 profile database :type profile_db: str

Runs ‘mmseqs search’ to search profiles against their consensus sequences and save the alignment results to an MMseqs2 alignment database. The profile_db and consensus_db MUST be the same size. :param profile_db: MMseqs2 profile database :type profile_db: str :param consensus_db: MMseqs2 sequence database :type consensus_db: str :param align_db: MMseqs2 alignment database :type align_db: str :param args: parsed command line arguments :type args: dict

pdm_utils.functions.phameration.parse_mcl_output(outfile)

Parse the mci output into phams :param outfile: mci output file :type outfile: str :return: phams :rtype: dict

pdm_utils.functions.phameration.parse_mmseqs_output(outfile)

Parses the indicated MMseqs2 FASTA-like file into a dictionary of integer-named phams. :param outfile: FASTA-like parseable output :type outfile: str :return: phams :rtype: dict

pdm_utils.functions.phameration.preserve_phams(old_phams, new_phams, old_colors, new_genes)

Attempts to keep pham numbers consistent from one round of pham building to the next :param old_phams: the dictionary that maps old phams to their genes :param new_phams: the dictionary that maps new phams to their genes :param old_colors: the dictionary that maps old phams to colors :param new_genes: the set of previously unphamerated genes :return:

pdm_utils.functions.phameration.reintroduce_duplicates(new_phams, trans_groups, genes_and_trans)

Reintroduces into each pham ALL GeneIDs that map onto the set of translations in the pham. :param new_phams: the pham dictionary for which duplicates are to be reintroduced :param trans_groups: the dictionary that maps translations to the GeneIDs that share them :param genes_and_trans: the dictionary that maps GeneIDs to their translations :return:

pdm_utils.functions.phameration.update_gene_table(phams, engine)

Updates the gene table with new pham data :param phams: new pham gene data :type phams: dict :param engine: sqlalchemy Engine allowing access to the database :return:

pdm_utils.functions.phameration.update_pham_table(colors, engine)

Populates the pham table with the new PhamIDs and their colors. :param colors: new pham color data :type colors: dict :param engine: sqlalchemy Engine allowing access to the database :return:

pdm_utils.functions.phameration.write_fasta(translation_groups, outfile)

Writes a FASTA file of the non-redundant protein sequences to be assorted into phamilies. :param translation_groups: groups of genes that share a translation :type translation_groups: dict :param outfile: FASTA filename :type outfile: str :return: