serialize: py: try to track segments of the source #5778

skshetry · 2021-04-07T10:41:15Z

We are working internally with the parsed object. So, as an example, if we have 1e-8 as a value in the code, we get it as 1e-08 which we try to replace later, but actually, the text in the code is 1e-8 due to which it fails.

Looking at the code, I'd really love to see AST based solution (parse into ast -> modify ast -> dump ast) than the current solution where we:

Parse into AST
Convert AST to a dictionary. Also add some stuffs of interests into it such as original source, lineno etc.
Modify dict.
Modify the source based on the new dictionary (by replacing the old value with the new one).
Dump the source (with ast.parse sanity check).

It fails on step 4, as the old value that we get is parsed to 1e-08 which we are trying to replace, instead of the actual 1e-8. With this PR, it also keeps track of the code segments of the value in step 2 and then uses it later when modifying on step 4. It is not perfect though, but it seemed to work nicely (as before). It does fallback to the original behavior if it does not find any segment.

❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

TODO

Error if the value that needs to be modified was not successfull.
Supporting composite/nested items.
Tests

skshetry · 2021-04-07T10:47:19Z

dvc/utils/serialize/_py.py

@@ -23,7 +23,8 @@ def parse_py(text, path):
    with reraise(SyntaxError, PythonFileCorruptedError(path)):
        tree = ast.parse(text, filename=path)

-    result = _ast_tree_to_dict(tree)
+    lines = text.splitlines()


@isidentical, would love to hear your thoughts on this module on how we can improve. Not really a priority for now though.

It would've been much simpler if we could used end_lineno/end_col_offsets (just like you did here) though they are only present in 3.8+ (instead of manually managing lines/location slices, there is even ast.get_source_segment).

There are a few options (depending on how complex a system that we might want to go, but from what I see this is just a convenience module so I don't think there is much need for extreme stuff), but it might be easy for to just record all changes (a list of change objects [Change(replacement='new_val', lineno=3, col_offset=20), ...]) and then apply them one by one over the tokens (tokenize.generate_tokens + tokenize.untokenize). I might write a draft PR for that, after I get through how this works.

@skshetry do we support replacing containers (like A = [1,2,3,4], -S a=[5,6,7,8] or dicts etc) or only literals (like strings / numbers etc)?

do we support replacing containers

It seems we do, but only some of the times, otherwise it fails.

It would have been simpler if we didn't support containers (actually anything represented as multi-token), though I assume we could still support selected number of literals (by keeping a stack of brackets/parens and selecting tokens until those brackets are balanced).

Here is a simple snippet that would replace a segment of code with another only with known start location (for single tokens, one exception we might introduce later would be complex numbers 1 + 0j etc). If we want to support containers, then would require a bit of more logic but still doable. (I didn't really test the code, FYI)

import ast import io import tokenize from collections import defaultdict from itertools import chain from dataclasses import dataclass @dataclass class Change: value: str lineno: int col_offset: int @classmethod def from_node(cls, node, value): return cls(value, node.lineno, node.col_offset) def shift_tokens(tokens, start, offset): for index, token in enumerate(tokens[start:]): start_line, start_col, end_line, end_col = *token.start, *token.end tokens[start + index] = token._replace( start=(start_line, start_col + offset), end=(end_line, end_col + offset) ) def apply_changes(source, changes): buffer = io.StringIO(source) token_groups = defaultdict(list) for token in tokenize.generate_tokens(buffer.readline): token_groups[token.start[0]].append(token) offset_map = defaultdict(int) for change in changes: token_group = token_groups[change.lineno] start_token, end_token = None, None for index, token in enumerate(token_group): if token.start[1] == change.col_offset + offset_map[change.lineno]: break else: raise ValueError('corrupted ...') offset = len(change.value) - token.end[1] token_group[index] = token._replace( string=change.value, end=(token.end[0], len(change.value)) ) shift_tokens(token_group, index + 1, offset) offset_map[change.lineno] += offset tokens = chain.from_iterable(token_groups.values()) return tokenize.untokenize(tokens) def find_assign_value(tree, target_name): for node in ast.walk(tree): if isinstance(node, ast.Assign): assert len(node.targets) == 1 [target] = node.targets elif isinstance(node, ast.AnnAssign): target = node.target else: continue if isinstance(target, ast.Name) and target.id == target_name: return node.value source = """\ # imagine this is a file temp = 1e-8 """ tree = ast.parse(source) node = find_assign_value(tree, 'temp') change = Change.from_node(node, '0.0001') print(apply_changes(source, [change]))

tests/unit/utils/serialize/test_python.py

Fix iterative#5777

skshetry requested review from efiop, isidentical and pmrowla April 7, 2021 10:45

skshetry self-assigned this Apr 7, 2021

skshetry added the bugfix fixes bug label Apr 7, 2021

skshetry commented Apr 7, 2021

View reviewed changes

tests/unit/utils/serialize/test_python.py Show resolved Hide resolved

skshetry marked this pull request as draft April 7, 2021 11:19

skshetry added 4 commits May 12, 2021 13:12

serialize: py: try to track segments of the source

9aa008e

Fix iterative#5777

add tests

dd32d64

strict test assertion

f5cd7e8

fix segment

ff850ca

skshetry force-pushed the fix-py-5777 branch from 22767b4 to ff850ca Compare May 12, 2021 07:28

temp

278b32e

skshetry closed this Jun 12, 2021

skshetry deleted the fix-py-5777 branch June 12, 2021 13:28

skshetry mentioned this pull request Nov 15, 2021

dvc exp run --set-param does not change parameter values from a Python dict #6974

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

serialize: py: try to track segments of the source #5778

serialize: py: try to track segments of the source #5778

skshetry commented Apr 7, 2021 •

edited

Loading

skshetry Apr 7, 2021

isidentical Apr 8, 2021

isidentical Apr 8, 2021

skshetry Apr 8, 2021

isidentical Apr 8, 2021 •

edited

Loading

serialize: py: try to track segments of the source #5778

serialize: py: try to track segments of the source #5778

Conversation

skshetry commented Apr 7, 2021 • edited Loading

TODO

skshetry Apr 7, 2021

Choose a reason for hiding this comment

isidentical Apr 8, 2021

Choose a reason for hiding this comment

isidentical Apr 8, 2021

Choose a reason for hiding this comment

skshetry Apr 8, 2021

Choose a reason for hiding this comment

isidentical Apr 8, 2021 • edited Loading

Choose a reason for hiding this comment

skshetry commented Apr 7, 2021 •

edited

Loading

isidentical Apr 8, 2021 •

edited

Loading