-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
serialize: py: try to track segments of the source #5778
Conversation
dvc/utils/serialize/_py.py
Outdated
@@ -23,7 +23,8 @@ def parse_py(text, path): | |||
with reraise(SyntaxError, PythonFileCorruptedError(path)): | |||
tree = ast.parse(text, filename=path) | |||
|
|||
result = _ast_tree_to_dict(tree) | |||
lines = text.splitlines() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@isidentical, would love to hear your thoughts on this module on how we can improve. Not really a priority for now though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would've been much simpler if we could used end_lineno/end_col_offsets (just like you did here) though they are only present in 3.8+ (instead of manually managing lines/location slices, there is even ast.get_source_segment
).
There are a few options (depending on how complex a system that we might want to go, but from what I see this is just a convenience module so I don't think there is much need for extreme stuff), but it might be easy for to just record all changes (a list of change objects [Change(replacement='new_val', lineno=3, col_offset=20), ...]
) and then apply them one by one over the tokens (tokenize.generate_tokens
+ tokenize.untokenize
). I might write a draft PR for that, after I get through how this works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skshetry do we support replacing containers (like A = [1,2,3,4]
, -S a=[5,6,7,8]
or dicts etc) or only literals (like strings / numbers etc)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we support replacing containers
It seems we do, but only some of the times, otherwise it fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would have been simpler if we didn't support containers (actually anything represented as multi-token), though I assume we could still support selected number of literals (by keeping a stack of brackets/parens and selecting tokens until those brackets are balanced).
Here is a simple snippet that would replace a segment of code with another only with known start location (for single tokens, one exception we might introduce later would be complex numbers 1 + 0j
etc). If we want to support containers, then would require a bit of more logic but still doable. (I didn't really test the code, FYI)
import ast
import io
import tokenize
from collections import defaultdict
from itertools import chain
from dataclasses import dataclass
@dataclass
class Change:
value: str
lineno: int
col_offset: int
@classmethod
def from_node(cls, node, value):
return cls(value, node.lineno, node.col_offset)
def shift_tokens(tokens, start, offset):
for index, token in enumerate(tokens[start:]):
start_line, start_col, end_line, end_col = *token.start, *token.end
tokens[start + index] = token._replace(
start=(start_line, start_col + offset),
end=(end_line, end_col + offset)
)
def apply_changes(source, changes):
buffer = io.StringIO(source)
token_groups = defaultdict(list)
for token in tokenize.generate_tokens(buffer.readline):
token_groups[token.start[0]].append(token)
offset_map = defaultdict(int)
for change in changes:
token_group = token_groups[change.lineno]
start_token, end_token = None, None
for index, token in enumerate(token_group):
if token.start[1] == change.col_offset + offset_map[change.lineno]:
break
else:
raise ValueError('corrupted ...')
offset = len(change.value) - token.end[1]
token_group[index] = token._replace(
string=change.value,
end=(token.end[0], len(change.value))
)
shift_tokens(token_group, index + 1, offset)
offset_map[change.lineno] += offset
tokens = chain.from_iterable(token_groups.values())
return tokenize.untokenize(tokens)
def find_assign_value(tree, target_name):
for node in ast.walk(tree):
if isinstance(node, ast.Assign):
assert len(node.targets) == 1
[target] = node.targets
elif isinstance(node, ast.AnnAssign):
target = node.target
else:
continue
if isinstance(target, ast.Name) and target.id == target_name:
return node.value
source = """\
# imagine this is a file
temp = 1e-8
"""
tree = ast.parse(source)
node = find_assign_value(tree, 'temp')
change = Change.from_node(node, '0.0001')
print(apply_changes(source, [change]))
Fix #5777
We are working internally with the parsed object. So, as an example, if we have
1e-8
as a value in the code, we get it as1e-08
which we try to replace later, but actually, the text in the code is1e-8
due to which it fails.Looking at the code, I'd really love to see AST based solution (parse into ast -> modify ast -> dump ast) than the current solution where we:
ast.parse
sanity check).It fails on step 4, as the old value that we get is parsed to 1e-08 which we are trying to replace, instead of the actual 1e-8. With this PR, it also keeps track of the code segments of the value in step 2 and then uses it later when modifying on step 4. It is not perfect though, but it seemed to work nicely (as before). It does fallback to the original behavior if it does not find any segment.
❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. 🙏
TODO