Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change in tokenize.generate_tokens behaviour with non-ASCII #112943

Closed
hugovk opened this issue Dec 10, 2023 · 3 comments
Closed

Change in tokenize.generate_tokens behaviour with non-ASCII #112943

hugovk opened this issue Dec 10, 2023 · 3 comments
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-parser type-bug An unexpected behavior, bug, or error

Comments

@hugovk
Copy link
Member

hugovk commented Dec 10, 2023

Bug report

Bug description:

This docstring has non-ASCII characters:

import io
import tokenize

src = '''\
def thing():
    """Autorzy, którzy tą jednostkę mają wpisani jako AKTUALNA -- czyli
    aktualni pracownicy, obecni pracownicy"""
    ...
'''
tokens = list(tokenize.generate_tokens(io.StringIO(src).readline))

for token in tokens:
    print(token)

assert tokens[7].end == (3, 45), tokens[7].end

And tokenize.generate_tokens has different behaviour between 3.11, and 3.12 (and 3.13).

Python 3.11

python3.11 --version
Python 3.11.7python3.11 1.py
TokenInfo(type=1 (NAME), string='def', start=(1, 0), end=(1, 3), line='def thing():\n')
TokenInfo(type=1 (NAME), string='thing', start=(1, 4), end=(1, 9), line='def thing():\n')
TokenInfo(type=54 (OP), string='(', start=(1, 9), end=(1, 10), line='def thing():\n')
TokenInfo(type=54 (OP), string=')', start=(1, 10), end=(1, 11), line='def thing():\n')
TokenInfo(type=54 (OP), string=':', start=(1, 11), end=(1, 12), line='def thing():\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 12), end=(1, 13), line='def thing():\n')
TokenInfo(type=5 (INDENT), string='    ', start=(2, 0), end=(2, 4), line='    """Autorzy, którzy tą jednostkę mają wpisani jako AKTUALNA -- czyli\n')
TokenInfo(type=3 (STRING), string='"""Autorzy, którzy tą jednostkę mają wpisani jako AKTUALNA -- czyli\n    aktualni pracownicy, obecni pracownicy"""', start=(2, 4), end=(3, 45), line='    """Autorzy, którzy tą jednostkę mają wpisani jako AKTUALNA -- czyli\n    aktualni pracownicy, obecni pracownicy"""\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 45), end=(3, 46), line='    aktualni pracownicy, obecni pracownicy"""\n')
TokenInfo(type=54 (OP), string='...', start=(4, 4), end=(4, 7), line='    ...\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(4, 7), end=(4, 8), line='    ...\n')
TokenInfo(type=6 (DEDENT), string='', start=(5, 0), end=(5, 0), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(5, 0), end=(5, 0), line='')

Python 3.12

python3.12 --version
Python 3.12.1python3.12 1.py
TokenInfo(type=1 (NAME), string='def', start=(1, 0), end=(1, 3), line='def thing():\n')
TokenInfo(type=1 (NAME), string='thing', start=(1, 4), end=(1, 9), line='def thing():\n')
TokenInfo(type=55 (OP), string='(', start=(1, 9), end=(1, 10), line='def thing():\n')
TokenInfo(type=55 (OP), string=')', start=(1, 10), end=(1, 11), line='def thing():\n')
TokenInfo(type=55 (OP), string=':', start=(1, 11), end=(1, 12), line='def thing():\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 12), end=(1, 13), line='def thing():\n')
TokenInfo(type=5 (INDENT), string='    ', start=(2, 0), end=(2, 4), line='    """Autorzy, którzy tą jednostkę mają wpisani jako AKTUALNA -- czyli\n')
TokenInfo(type=3 (STRING), string='"""Autorzy, którzy tą jednostkę mają wpisani jako AKTUALNA -- czyli\n    aktualni pracownicy, obecni pracownicy"""', start=(2, 4), end=(3, 41), line='    """Autorzy, którzy tą jednostkę mają wpisani jako AKTUALNA -- czyli\n    aktualni pracownicy, obecni pracownicy"""\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 45), end=(3, 46), line='    aktualni pracownicy, obecni pracownicy"""\n')
TokenInfo(type=55 (OP), string='...', start=(4, 4), end=(4, 7), line='    ...\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(4, 7), end=(4, 8), line='    ...\n')
TokenInfo(type=6 (DEDENT), string='', start=(5, 0), end=(5, 0), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(5, 0), end=(5, 0), line='')
Traceback (most recent call last):
  File "/private/tmp/1.py", line 15, in <module>
    assert tokens[7].end == (3, 45), tokens[7].end
AssertionError: (3, 41)

git bisect

Points to PR #104323 (gh-102856: Python tokenizer implementation for PEP 701).

What’s New In Python 3.12 » Changes in the Python API says:

Additionally, there may be some minor behavioral changes as a consequence of the changes required to support PEP 701. Some of these changes include:

This change isn't listed here, but is this an acceptable behavioural change or something to fix?

cc @lysnikolaou @pablogsal

More info

Originally reported at asottile/pyupgrade#923 by @mpasternak with the minimal reproducer created by @asottile.

CPython versions tested on:

3.12, 3.13

Operating systems tested on:

macOS

Linked PRs

@hugovk hugovk added type-bug An unexpected behavior, bug, or error 3.12 bugs and security fixes 3.13 bugs and security fixes labels Dec 10, 2023
@hugovk hugovk added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-parser labels Dec 10, 2023
pablogsal added a commit to pablogsal/cpython that referenced this issue Dec 11, 2023
pablogsal added a commit to pablogsal/cpython that referenced this issue Dec 11, 2023
…okens in the tokenize module (pythonGH-112949)

(cherry picked from commit a135a6d)

Co-authored-by: Pablo Galindo Salgado <[email protected]>
@pablogsal
Copy link
Member

Thanks for the report @hugovk

pablogsal added a commit that referenced this issue Dec 11, 2023
…in the tokenize module (GH-112949) (#112957)

(cherry picked from commit a135a6d)
@hugovk
Copy link
Member Author

hugovk commented Dec 11, 2023

Thanks for the quick fix!

@mpasternak This will be in Python 3.12.2, due in February 2024, and 3.13.0a3, due next week (Tuesday, 2023-12-19).

@mpasternak
Copy link

Probably the first case a comment in my software ever helped anybody. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-parser type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

3 participants