Iupac #32

heliziii · 2024-11-27T20:36:31Z

Adding support for IUPAC nucleotide codes to EnsembleTR.

…and the other one doesn't

aryarm

code looks great! Since the tests seem good and the code makes sense, I tried to see if I could help with speed, but I'm not sure if that's even necessary. Maybe you have a way of benchmarking the runtime?

aryarm · 2024-11-27T23:22:42Z

setup.py

You can delete the setup.py file now that you have a pyproject.toml

aryarm · 2024-11-27T23:41:40Z

ensembletr/utils.py

+def GetAllPossibleSequences(motif):
+    r"""
+    Computing all the possible sequences that a motif can be considering IUPAC characters.
+    For example, a motif with sequence RGG can be both AGG and GGG. Divide and conquer method is
+    used to form all possible sequences.
+
+    Parameters
+    ----------
+    motif : str
+       motif
+
+    Returns
+    -------
+    All possible sequences : list of str
+    """
+    possible_seqs = []
+    if len(motif) < 1:
+        return []
+    elif len(motif) == 1:
+        if motif[0] in IUPAC_map_dict:
+            return IUPAC_map_dict[motif[0]]
+        else:
+            return motif[0]
+    else:
+        first_part = GetAllPossibleSequences(motif[0:int(len(motif)/2)])
+        second_part = GetAllPossibleSequences(motif[int(len(motif)/2):])
+        for seq1 in first_part:
+            for seq2 in second_part:
+                possible_seqs.append(seq1 + seq2)
+    return possible_seqs


could you memoize this function to get a speedup? or would that require too much memory?

Suggested change

def GetAllPossibleSequences(motif):

r"""

Computing all the possible sequences that a motif can be considering IUPAC characters.

For example, a motif with sequence RGG can be both AGG and GGG. Divide and conquer method is

used to form all possible sequences.

Parameters

----------

motif : str

motif

Returns

-------

All possible sequences : list of str

"""

possible_seqs = []

if len(motif) < 1:

return []

elif len(motif) == 1:

if motif[0] in IUPAC_map_dict:

return IUPAC_map_dict[motif[0]]

else:

return motif[0]

else:

first_part = GetAllPossibleSequences(motif[0:int(len(motif)/2)])

second_part = GetAllPossibleSequences(motif[int(len(motif)/2):])

for seq1 in first_part:

for seq2 in second_part:

possible_seqs.append(seq1 + seq2)

return possible_seqs

@lru_cache(maxsize=None)

def GetAllPossibleSequences(motif):

r"""

Computing all the possible sequences that a motif can be considering IUPAC characters.

For example, a motif with sequence RGG can be both AGG and GGG. Divide and conquer method is

used to form all possible sequences.

Parameters

----------

motif : str

motif

Returns

-------

All possible sequences : set of str

"""

if len(motif) == 1:

return IUPAC_map_dict.get(motif, [motif])

first_part = GetAllPossibleSequences(motif[:len(motif)//2])

second_part = GetAllPossibleSequences(motif[len(motif)//2:])

return {f1 + f2 for f1 in first_part for f2 in second_part}

you might need to add this at the top:

from functools import lru_cache

aryarm · 2024-11-27T23:53:02Z

ensembletr/utils.py

+    potential_sequences_1 = [utils.GetCanonicalMotif(motif) for motif in GetAllPossibleSequences(motif_1)]
+    potential_sequences_2 = [utils.GetCanonicalMotif(motif) for motif in GetAllPossibleSequences(motif_2)]


it might be slightly faster to use sets instead of lists?

Suggested change

potential_sequences_1 = [utils.GetCanonicalMotif(motif) for motif in GetAllPossibleSequences(motif_1)]

potential_sequences_2 = [utils.GetCanonicalMotif(motif) for motif in GetAllPossibleSequences(motif_2)]

potential_sequences_1 = {utils.GetCanonicalMotif(motif) for motif in GetAllPossibleSequences(motif_1)}

potential_sequences_2 = {utils.GetCanonicalMotif(motif) for motif in GetAllPossibleSequences(motif_2)}

Helyaneh Ziaei-jam and others added 2 commits May 13, 2024 10:08

updating setup.py

d3fdd0f

adding support for IUPAC

186ae47

heliziii requested a review from aryarm November 27, 2024 20:36

aryarm mentioned this pull request Nov 27, 2024

fix: allow for IUPAC codes in GetCanonicalMotif() gymrek-lab/TRTools#239

Closed

Helyaneh Ziaei-jam added 2 commits November 27, 2024 13:00

adding comments

d509749

adding mutual motif for when one of the motifs has a IUPAC character …

fe509de

…and the other one doesn't

aryarm approved these changes Nov 27, 2024

View reviewed changes

aryarm linked an issue Dec 1, 2024 that may be closed by this pull request

IUPAC handling in ExpansionHunter gymrek-lab/TRTools#238

Closed

heliziii merged commit 1a3a8e6 into main Dec 6, 2024

aryarm mentioned this pull request Dec 6, 2024

IUPAC handling in ExpansionHunter gymrek-lab/TRTools#238

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iupac #32

Iupac #32

heliziii commented Nov 27, 2024

aryarm left a comment •

edited

Loading

aryarm Nov 27, 2024

aryarm Nov 27, 2024

aryarm Nov 27, 2024

		potential_sequences_1 = [utils.GetCanonicalMotif(motif) for motif in GetAllPossibleSequences(motif_1)]
		potential_sequences_2 = [utils.GetCanonicalMotif(motif) for motif in GetAllPossibleSequences(motif_2)]

Iupac #32

Iupac #32

Conversation

heliziii commented Nov 27, 2024

aryarm left a comment • edited Loading

Choose a reason for hiding this comment

aryarm Nov 27, 2024

Choose a reason for hiding this comment

aryarm Nov 27, 2024

Choose a reason for hiding this comment

aryarm Nov 27, 2024

Choose a reason for hiding this comment

aryarm left a comment •

edited

Loading