Tokenize generate_tokens regression in CPython 3.12 #111224

Erotemic · 2023-10-23T18:31:52Z

Bug report

Bug description:

I've noticed a regression when adding 3.12 support to xdoctest.

The following MWE has different behavior on 3.11 and 3.12.

import tokenize
lines = ['3, 4]', 'print(len(x))']
iterable = (line for line in lines if line)


def _readline():
    return next(iterable)

for t in tokenize.generate_tokens(_readline):
    print(t)

On 3.11 and earlier versions this will result in a tokenize.TokenError being raised:

TokenInfo(type=2 (NUMBER), string='3', start=(1, 0), end=(1, 1), line='3, 4]')
TokenInfo(type=54 (OP), string=',', start=(1, 1), end=(1, 2), line='3, 4]')
TokenInfo(type=2 (NUMBER), string='4', start=(1, 3), end=(1, 4), line='3, 4]')
TokenInfo(type=54 (OP), string=']', start=(1, 4), end=(1, 5), line='3, 4]')
TokenInfo(type=1 (NAME), string='print', start=(2, 0), end=(2, 5), line='print(len(x))')
TokenInfo(type=54 (OP), string='(', start=(2, 5), end=(2, 6), line='print(len(x))')
TokenInfo(type=1 (NAME), string='len', start=(2, 6), end=(2, 9), line='print(len(x))')
TokenInfo(type=54 (OP), string='(', start=(2, 9), end=(2, 10), line='print(len(x))')
TokenInfo(type=1 (NAME), string='x', start=(2, 10), end=(2, 11), line='print(len(x))')
TokenInfo(type=54 (OP), string=')', start=(2, 11), end=(2, 12), line='print(len(x))')
TokenInfo(type=54 (OP), string=')', start=(2, 12), end=(2, 13), line='print(len(x))')
Traceback (most recent call last):
  File "/home/joncrall/code/xdoctest/dev/tokenize_mwe.py", line 10, in <module>
    for t in tokenize.generate_tokens(_readline):
  File "/home/joncrall/.pyenv/versions/3.11.2/lib/python3.11/tokenize.py", line 525, in _tokenize
    raise TokenError("EOF in multi-line statement", (lnum, 0))
tokenize.TokenError: ('EOF in multi-line statement', (3, 0))

However, on 3.12, this no longer raises an error:

Instead I get:

TokenInfo(type=2 (NUMBER), string='3', start=(1, 0), end=(1, 1), line='3, 4]')
TokenInfo(type=55 (OP), string=',', start=(1, 1), end=(1, 2), line='3, 4]')
TokenInfo(type=2 (NUMBER), string='4', start=(1, 3), end=(1, 4), line='3, 4]')
TokenInfo(type=55 (OP), string=']', start=(1, 4), end=(1, 5), line='3, 4]')
TokenInfo(type=4 (NEWLINE), string='', start=(1, 5), end=(1, 6), line='3, 4]')
TokenInfo(type=1 (NAME), string='print', start=(2, 0), end=(2, 5), line='print(len(x))')
TokenInfo(type=55 (OP), string='(', start=(2, 5), end=(2, 6), line='print(len(x))')
TokenInfo(type=1 (NAME), string='len', start=(2, 6), end=(2, 9), line='print(len(x))')
TokenInfo(type=55 (OP), string='(', start=(2, 9), end=(2, 10), line='print(len(x))')
TokenInfo(type=1 (NAME), string='x', start=(2, 10), end=(2, 11), line='print(len(x))')
TokenInfo(type=55 (OP), string=')', start=(2, 11), end=(2, 12), line='print(len(x))')
TokenInfo(type=55 (OP), string=')', start=(2, 12), end=(2, 13), line='print(len(x))')
TokenInfo(type=4 (NEWLINE), string='', start=(2, 13), end=(2, 14), line='print(len(x))')
TokenInfo(type=0 (ENDMARKER), string='', start=(3, 0), end=(3, 0), line='')

This is a problem for xdoctest because it uses tokenize to determine if a statement is "balanced" (i.e. if it is part of a line continuation or not). This is the magic I use to autodetect PS1 vs PS2 lines and prevent users from needing to manually specify if a line is a continuation or not.

Looking through the release and migration notes, I don't see anything that would indicate that this new behavior is introduced, so I suspect it is a bug. I'm sorry I didn't catch this before the 3.12 release. I've been busy.

If this is not a bug and an intended change, then it should be documented (please link to the relevant section if I missed it). If there is a way to work around this so xdoctest works on 3.12.0 that would be helpful. (It's probably time some of the parsing code got a rewrite anyway).

CPython versions tested on:

3.11, 3.12

Operating systems tested on:

Linux

The text was updated successfully, but these errors were encountered:

sobolevn · 2023-10-23T18:43:24Z

cc @pablogsal @lysnikolaou

terryjreedy · 2023-10-23T20:42:38Z

lines = list(lines) does not do anything when lines is a list, as here. Confusing in a claimed MWE.

For 3.12, the tokenize function was rewritten to use the real C-coded tokenize function instead of Python code intended to mimic the C function. "Looking through the release and migration notes,": did you look through the changelog? for 'token'? There were numerous bugfix issues that would not be mentioned in What's New.

Erotemic · 2023-10-23T21:18:40Z

Sorry about the extra line, I'll remove it.

Of the patches in the 3.12 changelog I've collected the ones that reference tokenize. I don't see any obvious notes indicating this should have changed, but if the implementation changed from Python to C, then it could be the case that the C variant never errored in the above case.

gh-105564: Don’t include artificil newlines in the line attribute of tokens in the APIs of the tokenize module. Patch by Pablo Galindo

gh-105549: Tokenize separately NUMBER and NAME tokens that are not ambiguous. Patch by Pablo Galindo.

gh-105390: Correctly raise tokenize.TokenError exceptions instead of SyntaxError for tokenize errors such as incomplete input. Patch by Pablo Galindo

gh-105259: Don’t include newline character for trailing NEWLINE tokens emitted in the tokenize module. Patch by Pablo Galindo

gh-105324: Fix the main function of the tokenize module when reading from sys.stdin. Patch by Pablo Galindo

gh-105017: Show CRLF lines in the tokenize string attribute in both NL and NEWLINE tokens. Patch by Marta Gómez.

gh-105017: Do not include an additional final NL token when parsing files having CRLF lines. Patch by Marta Gómez.

gh-104976: Ensure that trailing DEDENT tokenize.TokenInfo objects emitted by the tokenize module are reported as in Python 3.11. Patch by Pablo Galindo

gh-104972: Ensure that the line attribute in tokenize.TokenInfo objects in the tokenize module are always correct. Patch by Pablo Galindo

gh-104825: Tokens emitted by the tokenize module do not include an implicit \n character in the line attribute anymore. Patch by Pablo Galindo

gh-102856: Implement PEP 701 changes in the tokenize module. Patch by Marta Gómez Macías and Pablo Galindo Salgado

gh-102856: Implement the required C tokenizer changes for PEP 701. Patch by Pablo Galindo Salgado, Lysandros Nikolaou, Batuhan Taskaya, Marta Gómez Macías and sunmy2019.

gh-99891: Fix a bug in the tokenizer that could cause infinite recursion when showing syntax warnings that happen in the first line of the source. Patch by Pablo Galindo

gh-99581: Fixed a bug that was causing a buffer overflow if the tokenizer copies a line missing the newline caracter from a file that is as long as the available tokenizer buffer. Patch by Pablo galindo

gh-97997: Add running column offset to the tokenizer state to avoid calculating AST column information with pointer arithmetic.

gh-97973: Modify the tokenizer to return all necessary information the parser needs to set location information in the AST nodes, so that the parser does not have to calculate those doing pointer arithmetic.

gh-94360: Fixed a tokenizer crash when reading encoded files with syntax errors from stdin with non utf-8 encoded text. Patch by Pablo Galindo

EDIT: Thanks for the pointer about the change in implementation from Python -> C. I was able to patch xdoctest by including a vendored copy of tokenize.py from Python 3.11. This will ensure the library works with 3.12.0. However, it would be good to determine if this new behavior is indended or if the C code should also raise an error in the same instance.

pablogsal · 2023-10-24T02:09:43Z

Thanks for opening this issue, I will take a look but unfortunately there isn't much we can promise here. Notice this warning on the tokenize docs: https://docs.python.org/3/library/tokenize.html

Warning Note that the functions in this module are only designed to parse syntactically valid Python code (code that does not raise when parsed using ast.parse()). The behavior of the functions in this module is undefined when providing invalid Python code and it can change at any point.

As per the warning this means that the behavior of the functions is not defined when you pass code that's not syntactically valid, which includes this case.

Guaranteeing anything for invalid code is very hard for us because different users have different expectations. Indeed, some users have asked us to not raise in this case because they want to still receive the tokens even if they are unbalanced. Doing what you are suggesting will now break another set of users that don't expect us to raise.

pablogsal · 2023-10-24T02:10:41Z

This will ensure the library works with 3.12.0.

Notice this means that you won't be able to parse PEP 701 based code with that tokenize implementation

pablogsal · 2023-10-24T02:15:02Z

This is a problem for xdoctest because it uses tokenize to determine if a statement is "balanced" (i.e. if it is part of a line continuation or not). This is the magic I use to autodetect PS1 vs PS2 lines and prevent users from needing to manually specify if a line is a continuation or not.

This may not be the best tool for your use case, because is still somehow brittle. You can manually check if your statement is balanced by checking the tokens and check the paren balance yourself (adding +1 on opening of brackets and -1 on close). You can also consider the codeop module to check if the statement is balanced:

>>> codeop.compile_command("[3, 4")
>>> codeop.compile_command("[3, 4]")
<code object <module> at 0x104de6590, file "<input>", line 1>

this raises if is not balanced on the other end:

>>> codeop.compile_command("3, 4]")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/pgalindo3/.pyenv/versions/3.12.0/lib/python3.12/codeop.py", line 107, in compile_command
    return _maybe_compile(_compile, source, filename, symbol)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pgalindo3/.pyenv/versions/3.12.0/lib/python3.12/codeop.py", line 73, in _maybe_compile
    return compiler(source, filename, symbol)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/pgalindo3/.pyenv/versions/3.12.0/lib/python3.12/codeop.py", line 86, in _compile
    return compile(source, filename, symbol, PyCF_DONT_IMPLY_DEDENT | PyCF_ALLOW_INCOMPLETE_INPUT)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<input>", line 1
    3, 4]
        ^
SyntaxError: unmatched ']'

pablogsal · 2023-10-24T02:16:20Z

Yeah, confirmed that raising in this cade breaks some existing code that was adapted after 3.12 was released, so I propose to close this as won't fix. @lysnikolaou what do you think?

Erotemic · 2023-10-24T03:37:33Z

Notice this means that you won't be able to parse PEP 701 based code with that tokenize implementation

Right... that means I do need to find a new solution. That's my bad for relying on undefined behavior.

This needs to work not only for braces / brakets, but also for quotes. Also the current solution will mark 'try: raise Exception' as balanced (and the current code does seem to rely on this behavior). I wrote this so long ago, I'm forgetting all the cases, maybe the exception case is only failing a contrived test and it's not actually necessary.

What I ultimately need to do is: given a set of text lines that is probably valid Python code, label each line as starting its own statement (a PS1 line), or that it is continuing a previous statement (a PS2 line). Python has come a long way since I first implemented this in 3.7 and had to deal with issue16806, so maybe the current tooling (and better line number management) will help clean up the xdoctest parsing code.

pablogsal · 2023-10-24T04:38:29Z

Technically speaking there is a map that will be able to identify PS1 lines with only the line as input but there is no such thing for PS2 because that depends on whatever has been written as a PS1 line. You can label them as "potentially" PS2 lines, but that will be very brittle. You can identify PS2 lines if you have something that can look at previous lines.

lysnikolaou · 2023-10-24T09:58:38Z

Yeah, confirmed that raising in this cade breaks some existing code that was adapted after 3.12 was released, so I propose to close this as won't fix.

Agreed. Especially since we did the work to make sure that cases like do not raise due to PyCQA tools etc.

Thanks for the report @Erotemic and sorry we can't do more.

tfmorris · 2024-01-10T19:16:10Z

FWIW Python 3.12 broke web.py / OpenLibrary as well. webpy/webpy#784

Erotemic added the type-bug An unexpected behavior, bug, or error label Oct 23, 2023

AlexWaygood added the topic-parser label Oct 23, 2023

lysnikolaou closed this as not planned Won't fix, can't repro, duplicate, stale Oct 24, 2023

Erotemic mentioned this issue Oct 25, 2023

Dev/312 rework Erotemic/xdoctest#149

Closed

scottbarnes mentioned this issue Jan 6, 2024

Python 3.12: tokenize.TokenError: ('unterminated string literal webpy/webpy#784

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenize generate_tokens regression in CPython 3.12 #111224

Tokenize generate_tokens regression in CPython 3.12 #111224

Erotemic commented Oct 23, 2023 •

edited

Loading

sobolevn commented Oct 23, 2023

terryjreedy commented Oct 23, 2023

Erotemic commented Oct 23, 2023 •

edited

Loading

pablogsal commented Oct 24, 2023

pablogsal commented Oct 24, 2023

pablogsal commented Oct 24, 2023

pablogsal commented Oct 24, 2023

Erotemic commented Oct 24, 2023 •

edited

Loading

pablogsal commented Oct 24, 2023

lysnikolaou commented Oct 24, 2023

tfmorris commented Jan 10, 2024

Tokenize generate_tokens regression in CPython 3.12 #111224

Tokenize generate_tokens regression in CPython 3.12 #111224

Comments

Erotemic commented Oct 23, 2023 • edited Loading

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

sobolevn commented Oct 23, 2023

terryjreedy commented Oct 23, 2023

Erotemic commented Oct 23, 2023 • edited Loading

pablogsal commented Oct 24, 2023

pablogsal commented Oct 24, 2023

pablogsal commented Oct 24, 2023

pablogsal commented Oct 24, 2023

Erotemic commented Oct 24, 2023 • edited Loading

pablogsal commented Oct 24, 2023

lysnikolaou commented Oct 24, 2023

tfmorris commented Jan 10, 2024

Erotemic commented Oct 23, 2023 •

edited

Loading

Erotemic commented Oct 23, 2023 •

edited

Loading

Erotemic commented Oct 24, 2023 •

edited

Loading