-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iupac #32
Conversation
…and the other one doesn't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code looks great! Since the tests seem good and the code makes sense, I tried to see if I could help with speed, but I'm not sure if that's even necessary. Maybe you have a way of benchmarking the runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can delete the setup.py
file now that you have a pyproject.toml
def GetAllPossibleSequences(motif): | ||
r""" | ||
Computing all the possible sequences that a motif can be considering IUPAC characters. | ||
For example, a motif with sequence RGG can be both AGG and GGG. Divide and conquer method is | ||
used to form all possible sequences. | ||
|
||
Parameters | ||
---------- | ||
motif : str | ||
motif | ||
|
||
Returns | ||
------- | ||
All possible sequences : list of str | ||
""" | ||
possible_seqs = [] | ||
if len(motif) < 1: | ||
return [] | ||
elif len(motif) == 1: | ||
if motif[0] in IUPAC_map_dict: | ||
return IUPAC_map_dict[motif[0]] | ||
else: | ||
return motif[0] | ||
else: | ||
first_part = GetAllPossibleSequences(motif[0:int(len(motif)/2)]) | ||
second_part = GetAllPossibleSequences(motif[int(len(motif)/2):]) | ||
for seq1 in first_part: | ||
for seq2 in second_part: | ||
possible_seqs.append(seq1 + seq2) | ||
return possible_seqs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you memoize this function to get a speedup? or would that require too much memory?
def GetAllPossibleSequences(motif): | |
r""" | |
Computing all the possible sequences that a motif can be considering IUPAC characters. | |
For example, a motif with sequence RGG can be both AGG and GGG. Divide and conquer method is | |
used to form all possible sequences. | |
Parameters | |
---------- | |
motif : str | |
motif | |
Returns | |
------- | |
All possible sequences : list of str | |
""" | |
possible_seqs = [] | |
if len(motif) < 1: | |
return [] | |
elif len(motif) == 1: | |
if motif[0] in IUPAC_map_dict: | |
return IUPAC_map_dict[motif[0]] | |
else: | |
return motif[0] | |
else: | |
first_part = GetAllPossibleSequences(motif[0:int(len(motif)/2)]) | |
second_part = GetAllPossibleSequences(motif[int(len(motif)/2):]) | |
for seq1 in first_part: | |
for seq2 in second_part: | |
possible_seqs.append(seq1 + seq2) | |
return possible_seqs | |
@lru_cache(maxsize=None) | |
def GetAllPossibleSequences(motif): | |
r""" | |
Computing all the possible sequences that a motif can be considering IUPAC characters. | |
For example, a motif with sequence RGG can be both AGG and GGG. Divide and conquer method is | |
used to form all possible sequences. | |
Parameters | |
---------- | |
motif : str | |
motif | |
Returns | |
------- | |
All possible sequences : set of str | |
""" | |
if len(motif) == 1: | |
return IUPAC_map_dict.get(motif, [motif]) | |
first_part = GetAllPossibleSequences(motif[:len(motif)//2]) | |
second_part = GetAllPossibleSequences(motif[len(motif)//2:]) | |
return {f1 + f2 for f1 in first_part for f2 in second_part} |
you might need to add this at the top:
from functools import lru_cache
potential_sequences_1 = [utils.GetCanonicalMotif(motif) for motif in GetAllPossibleSequences(motif_1)] | ||
potential_sequences_2 = [utils.GetCanonicalMotif(motif) for motif in GetAllPossibleSequences(motif_2)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be slightly faster to use sets instead of lists?
potential_sequences_1 = [utils.GetCanonicalMotif(motif) for motif in GetAllPossibleSequences(motif_1)] | |
potential_sequences_2 = [utils.GetCanonicalMotif(motif) for motif in GetAllPossibleSequences(motif_2)] | |
potential_sequences_1 = {utils.GetCanonicalMotif(motif) for motif in GetAllPossibleSequences(motif_1)} | |
potential_sequences_2 = {utils.GetCanonicalMotif(motif) for motif in GetAllPossibleSequences(motif_2)} |
Adding support for IUPAC nucleotide codes to EnsembleTR.