Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

serialize: py: try to track segments of the source #5778

Closed
wants to merge 5 commits into from

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Apr 7, 2021

Fix #5777

We are working internally with the parsed object. So, as an example, if we have 1e-8 as a value in the code, we get it as 1e-08 which we try to replace later, but actually, the text in the code is 1e-8 due to which it fails.

Looking at the code, I'd really love to see AST based solution (parse into ast -> modify ast -> dump ast) than the current solution where we:

  1. Parse into AST
  2. Convert AST to a dictionary. Also add some stuffs of interests into it such as original source, lineno etc.
  3. Modify dict.
  4. Modify the source based on the new dictionary (by replacing the old value with the new one).
  5. Dump the source (with ast.parse sanity check).

It fails on step 4, as the old value that we get is parsed to 1e-08 which we are trying to replace, instead of the actual 1e-8. With this PR, it also keeps track of the code segments of the value in step 2 and then uses it later when modifying on step 4. It is not perfect though, but it seemed to work nicely (as before). It does fallback to the original behavior if it does not find any segment.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

TODO

  • Error if the value that needs to be modified was not successfull.
  • Supporting composite/nested items.
  • Tests

@skshetry skshetry self-assigned this Apr 7, 2021
@skshetry skshetry added the bugfix fixes bug label Apr 7, 2021
@@ -23,7 +23,8 @@ def parse_py(text, path):
with reraise(SyntaxError, PythonFileCorruptedError(path)):
tree = ast.parse(text, filename=path)

result = _ast_tree_to_dict(tree)
lines = text.splitlines()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@isidentical, would love to hear your thoughts on this module on how we can improve. Not really a priority for now though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would've been much simpler if we could used end_lineno/end_col_offsets (just like you did here) though they are only present in 3.8+ (instead of manually managing lines/location slices, there is even ast.get_source_segment).

There are a few options (depending on how complex a system that we might want to go, but from what I see this is just a convenience module so I don't think there is much need for extreme stuff), but it might be easy for to just record all changes (a list of change objects [Change(replacement='new_val', lineno=3, col_offset=20), ...]) and then apply them one by one over the tokens (tokenize.generate_tokens + tokenize.untokenize). I might write a draft PR for that, after I get through how this works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skshetry do we support replacing containers (like A = [1,2,3,4], -S a=[5,6,7,8] or dicts etc) or only literals (like strings / numbers etc)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we support replacing containers

It seems we do, but only some of the times, otherwise it fails.

Copy link
Contributor

@isidentical isidentical Apr 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would have been simpler if we didn't support containers (actually anything represented as multi-token), though I assume we could still support selected number of literals (by keeping a stack of brackets/parens and selecting tokens until those brackets are balanced).

Here is a simple snippet that would replace a segment of code with another only with known start location (for single tokens, one exception we might introduce later would be complex numbers 1 + 0j etc). If we want to support containers, then would require a bit of more logic but still doable. (I didn't really test the code, FYI)

import ast
import io
import tokenize
from collections import defaultdict
from itertools import chain
from dataclasses import dataclass

@dataclass
class Change:
    value: str
    lineno: int
    col_offset: int

    @classmethod
    def from_node(cls, node, value):
        return cls(value, node.lineno, node.col_offset)

def shift_tokens(tokens, start, offset):
    for index, token in enumerate(tokens[start:]):
        start_line, start_col, end_line, end_col = *token.start, *token.end
        tokens[start + index] = token._replace(
            start=(start_line, start_col + offset),
            end=(end_line, end_col + offset)
        )
    
def apply_changes(source, changes):
    buffer = io.StringIO(source)
    token_groups = defaultdict(list)
    for token in tokenize.generate_tokens(buffer.readline):
        token_groups[token.start[0]].append(token)

    offset_map = defaultdict(int)
    for change in changes:
        token_group = token_groups[change.lineno]
        start_token, end_token = None, None
        for index, token in enumerate(token_group):
            if token.start[1] == change.col_offset + offset_map[change.lineno]:
                break
        else:
            raise ValueError('corrupted ...')
        
        offset = len(change.value) - token.end[1]
        token_group[index] = token._replace(
            string=change.value,
            end=(token.end[0], len(change.value))
        )
        shift_tokens(token_group, index + 1, offset)
        offset_map[change.lineno] += offset
    
    tokens = chain.from_iterable(token_groups.values())
    return tokenize.untokenize(tokens)

def find_assign_value(tree, target_name):
    for node in ast.walk(tree):
        if isinstance(node, ast.Assign):
            assert len(node.targets) == 1
            [target] = node.targets
        elif isinstance(node, ast.AnnAssign):
            target = node.target
        else:
            continue
        
        if isinstance(target, ast.Name) and target.id == target_name:
            return node.value

source = """\
# imagine this is a file
temp = 1e-8
"""

tree = ast.parse(source)
node = find_assign_value(tree, 'temp')
change = Change.from_node(node, '0.0001')
print(apply_changes(source, [change]))

@skshetry skshetry marked this pull request as draft April 7, 2021 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

exp run --set-param fails to update param value
2 participants