Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reject invalid escape sequences (and octal escape sequences) in bytes and Unicode strings #98401

Closed
vstinner opened this issue Oct 18, 2022 · 19 comments
Labels
type-bug An unexpected behavior, bug, or error

Comments

@vstinner
Copy link
Member

vstinner commented Oct 18, 2022

In Python 3.6, invalid escape sequence were deprecated in string literals (bytes and str): issue #71551, commit 110b6fe.

What's New in Python 3.6: Deprecated Python behavior:

A backslash-character pair that is not a valid escape sequence now generates a DeprecationWarning. Although this will eventually become a SyntaxError, that will not be for several Python releases. (Contributed by Emanuel Barry in bpo-27364.)

I propose now raises a SyntaxError, rather than a DeprecationWarning (which is silent in most cases).

Example:

$ python3 -W default -c 'print(list("\z"))'
<string>:1: DeprecationWarning: invalid escape sequence '\z'
['\\', 'z']
$ python3 -W default -c 'print(list(b"\z"))'
<string>:1: DeprecationWarning: invalid escape sequence '\z'
[92, 122]

Note: Python REPL ate some DeprecationWarning which makes manual testing harder. It was fixed last month by commit 426d72e in issue gh-96052.


Python 3.11 now emits a deprecation warning for invalid octal escape sequence (issue gh-81548):

Octal escapes in string and bytes literals with value larger than 0o377 now produce DeprecationWarning. In a future Python version they will be a SyntaxWarning and eventually a SyntaxError. (Contributed by Serhiy Storchaka in gh-81548.)

Example:

$ python3.11 -Wdefault -c 'print(list(b"\777"))'
<string>:1: DeprecationWarning: invalid octal escape sequence '\777'
[255]
@vstinner vstinner added the type-bug An unexpected behavior, bug, or error label Oct 18, 2022
@vstinner
Copy link
Member Author

I created PR #98404 to implement this change.

This change should help to catch some mistakes in regular expressions and Windows paths.

Example with PR #98404 which now raises SyntaxError:

>>> import re
>>> re.findall('.py\B', '1python 2pyc 3pyo 4py')
SyntaxError: invalid escape sequence '\B'
>>> wrong_path = "C:\Program Files\Python\python.exe"
SyntaxError: invalid escape sequence '\P'

Raw strings r'...' should be used instead:

>>> import re
>>> re.findall(r'.py\B', '1python 2pyc 3pyo 4py')
['1py', '2py', '3py']
>>> wrong_path = r"C:\Program Files\Python\python.exe"
>>> wrong_path
'C:\\Program Files\\Python\\python.exe'

@vstinner
Copy link
Member Author

cc @serhiy-storchaka @hugovk

@mdickinson
Copy link
Member

See also discussion in #77093, and the reasons that the SyntaxWarning change was rolled back.

@serhiy-storchaka
Copy link
Member

When this warning was introduced, there were no any plans of making it an error in the near future. It was planned as a warning with very long period.

The specific of this warning is that it is not emitted for the code containing this specific kind of bug, but only emitted for the code which does not contain a bug. But which may be in near proximity of the code which contains a bug (and does not emit a warning itself).

Perhaps it is a time to make it more visible (convert it into SyntaxWarning). And then after other long period of 4-5 versions it could be converted into SyntaxError.

@hugovk
Copy link
Member

hugovk commented Oct 18, 2022

It first went into Python 3.6, now EOL, so all supported Python versions have the deprecation warning.


Serhiy suggested in #71551 (comment) a DeprecationWarning or PendingDeprecationWarning in 3.6, a SyntaxWarning in 3.8 and a SyntaxError in 4.0:

I think "a silent warning" means that it should emit a DeprecationWarning or a PendingDeprecationWarning. Since there is no haste, we should use 2-releases deprecation period. After this a deprecation can be changed to a SynataxWarning in 3.8 and to a UnicodeDecodeError (for strings) and a ValueError (for bytes) in 4.0. The latter are converted to SyntaxError by parser.

So that was a suggestion of 2 releases from deprecation to syntax warning. If we do a syntax warning now, that will have already been 6 releases.

I don't know what the original estimate of 3.8 -> 4.0 was? Also two releases?


Guido suggested in #71551 (comment) for "several" releases before the error.

I think ultimately it has to become an error (otherwise I wouldn't
have agreed to the warning, silent or not). But because there's so
much 3rd party code that depends on it we indeed need to take
"several" releases before we go there.

Six releases (3.6 -> 3.12) is perhaps several?


The original issue also had suggestions from Victor and Guido to contact linters/PyCQA to include a warning to help projects prepare, and the original author did so.

And the good news is it was added to pycodestyle (part of Flake8) as W605 in April 2018, so that's 4.5 years of linter warnings. (Thanks to this, I've fixed invalid escape sequences in several packages.)

@vstinner Would it be worth running pycodestyle --select W605 on some top list of packages to get an idea of exposure?

Depending on the result, it may be worth promoting to a SyntaxWarning for an extra release or two, or even keeping the DeprecationWarning a bit longer (in light of #77093 (comment)).

@vstinner
Copy link
Member Author

vstinner commented Nov 2, 2022

Ok, let's start with replacing DeprecationWarning (silent by default) with SyntaxWarning (displayed once by default): PR #99011.

vstinner added a commit that referenced this issue Nov 3, 2022
A backslash-character pair that is not a valid escape sequence now
generates a SyntaxWarning, instead of DeprecationWarning.  For
example, re.compile("\d+\.\d+") now emits a SyntaxWarning ("\d" is an
invalid escape sequence), use raw strings for regular expression:
re.compile(r"\d+\.\d+"). In a future Python version, SyntaxError will
eventually be raised, instead of SyntaxWarning.

Octal escapes with value larger than 0o377 (ex: "\477"), deprecated
in Python 3.11, now produce a SyntaxWarning, instead of
DeprecationWarning. In a future Python version they will be
eventually a SyntaxError.

codecs.escape_decode() and codecs.unicode_escape_decode() are left
unchanged: they still emit DeprecationWarning.

* The parser only emits SyntaxWarning for Python 3.12 (feature
  version), and still emits DeprecationWarning on older Python
  versions.
* Fix SyntaxWarning by using raw strings in Tools/c-analyzer/ and
  wasm_build.py.
@vstinner
Copy link
Member Author

vstinner commented Nov 3, 2022

Fixed by a60ddd3

At the end, it remains a warning, but SyntaxWarning (showed by default) is now emitted instead of DeprecationWarning (silent by default). According to @hugovk, sadly many projects of the PyPI top 5000 contain invalid escape sequences. It will take time to update them, before considering to convert the SyntaxWarning to an SyntaxError.

Thanks for everybody who helped me on making this change possible!

For affected projects: just add r at the beginning of your strings to disable escape sequences. But be careful, "newline:\n, invalid: \P" cannot be simply converted to a raw string by adding r since it would convert the newline character (U+000A) to two characters (backslash followed by newline). The correct fix is to double the second backslash: "newline:\n, invalid: \\P" (\P becomes \\P). An alternative is to mix different kinds of strings: "newline:\n" r", invalid:\P" or "newline:\n" + r", invalid:\P" (depending on your personal preference ;-)).

@vstinner vstinner closed this as completed Nov 3, 2022
@serhiy-storchaka
Copy link
Member

In case of regular expressions, '\n' and r'\n' mean the same, so in almost all cases you can freely add the r prefix. The only exception is \b not in a character set. But it is extremely rare, and if you have one in non-raw string representing a regular expression, it is most likely a bug.

I'm wondering if we'll be forced to roll back these changes before release again due to a lot of warnings in third-party code.

@vstinner
Copy link
Member Author

vstinner commented Nov 3, 2022

I'm wondering if we'll be forced to roll back these changes before release again due to a lot of warnings in third-party code.

I created https://discuss.python.org/t/collaboration-on-handling-python-3-12-incompatible-changes-distutils-removal-invalid-escape-escape-etc/20721 to discuss this change.

@fenchu
Copy link

fenchu commented Nov 23, 2023

If you work on windows and have the habit of documenting your code inside trippelquotes, (A company requirement) you now get a warning whenever a directory name starts on a d because ex: """C:\dist\project\file.dat""" gives a SyntaxWarning.

And not to mention the regexp issue that have a fix.

More work to port a 3.11 project to 3.12 than moving it from python2 to python3

@vstinner
Copy link
Member Author

Just replace """C:\dist\project\file.dat""" with r"""C:\dist\project\file.dat""": add r prefix.

@oliviermattelaer
Copy link

oliviermattelaer commented Dec 16, 2023

Is there a tool that modifies the problematic regular expression?
I have so many syntax warning inside my code that doing it manually sounds too risky

ryantking pushed a commit to fsc-platform/fork-ansible-role-proxmox that referenced this issue Aug 5, 2024
In Python, only a [limited set of characters][0] can be escaped with a
backslash. In recent versions of Python, attempting to escape a
non-escapeable character [raises a SyntaxWarning][1], polluting the
playbook output with warnings like:

    <unknown>:59: SyntaxWarning: invalid escape sequence '\('
    <unknown>:60: SyntaxWarning: invalid escape sequence '\.'
    <unknown>:61: SyntaxWarning: invalid escape sequence '\.'

This commit adds the string literal prefix 'r' to regular expressions in
pvesh.py to ensure that escape sequences are not interpreted in the
given strings. As there were no valid escape sequences in those strings
to begin with, the actual string content remains the same.

[0]: https://docs.python.org/3/reference/lexical_analysis.html#escape-sequences
[1]: python/cpython#98401
jrguzman-ms pushed a commit to msft-mirror-aosp/platform.external.libchrome that referenced this issue Sep 12, 2024
Was seeing a lot of SyntaxWarnings when building as some regex
strings were not raw strings and thus were showing invalid escape
sequences due to (python/cpython#98401).
These are just warnings now on Python 3.12, but will become errors
in the future. Fixing now to prevent builds breaking. It seems the
files are mostly following this convention now, so this makes the
non raw strings consistent.

Change-Id: Ia8eacfd353c41a6c1d8e0482f3b7fbd4dc7a93e1
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5832811
Reviewed-by: Mihai Sardarescu <[email protected]>
Commit-Queue: Andrew Grieve <[email protected]>
Reviewed-by: Andrew Grieve <[email protected]>
Cr-Commit-Position: refs/heads/main@{#1352933}


CrOS-Libchrome-Original-Commit: 31f4229550ffa8f15f7a5da9b67a84cc4f4eb757
devsnek pushed a commit to denoland/chromium_build that referenced this issue Sep 17, 2024
Was seeing a lot of SyntaxWarnings when building as some regex
strings were not raw strings and thus were showing invalid escape
sequences due to (python/cpython#98401).
These are just warnings now on Python 3.12, but will become errors
in the future. Fixing now to prevent builds breaking. It seems the
files are mostly following this convention now, so this makes the
non raw strings consistent.

Change-Id: Ia8eacfd353c41a6c1d8e0482f3b7fbd4dc7a93e1
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5832811
Reviewed-by: Mihai Sardarescu <[email protected]>
Commit-Queue: Andrew Grieve <[email protected]>
Reviewed-by: Andrew Grieve <[email protected]>
Cr-Commit-Position: refs/heads/main@{#1352933}
NOKEYCHECK=True
GitOrigin-RevId: 31f4229550ffa8f15f7a5da9b67a84cc4f4eb757
ntpsec-bot pushed a commit to ntpsec/ntpsec that referenced this issue Oct 2, 2024
Per https://docs.python.org/3/whatsnew/3.12.html:

A backslash-character pair that is not a valid escape sequence now generates a
SyntaxWarning, instead of DeprecationWarning.

See python/cpython#98401

The warnings without this change are:

attic/calc_tickadj/calc_tickadj:74: SyntaxWarning: invalid escape sequence '\d'
contrib/cpu-temp-log:60: SyntaxWarning: invalid escape sequence '\s'
@umarbutler
Copy link

I stumbled on this thread after getting tired of having the same warning pop up over and over again every time I paste a Windows-formatted path into a notebook or script.

Looking at the 'What's New in Python 3.12' article on the Python website, it seems that '[i]n a future Python version, SyntaxError will eventually be raised, instead of SyntaxWarning'.

If that is the case, then I would strongly advise raising a warning that is more descriptive than just SyntaxWarning. I had just been ignoring the warnings as a result, but had I seen a warning that used the word deprecation, I would've immediately known that I need to update all of my code now.

Additionally, is it really worth breaking so much existing code, and making it more difficult to copy and paste Windows paths, or even write a simple string like "You need to sleep and/or rest!", to achieve whatever objective this change is meant to achieve?

@alphaparrot
Copy link

I'm with @umarbutler and @fenchu on this one; this seems like an unnecessary change that creates a large amount of work across a very wide range of projects. Windows paths have already been mentioned, but a very big one that will come up very quickly and annoyingly is that much of the scientific (and mathematical) community has adopted Python for analysis tasks, and there are many libraries and packages written in Python for performing scientific tasks. These projects often include LaTeX-formatted sequences in their docstrings, as some documentation engines can automatically parse LaTeX math sequences into mathematical symbols. LaTeX macros for greek letters and other symbols typically start with a backslash; the inclusion of thoroughly-documented equations in docstrings in scientific packages will therefore break those packages once this becomes an Error rather than a Warning (and the move to SyntaxWarning means those packages will now start spitting loud warnings).

The nature of scientific code development is that a great deal of it is built by academics in their spare time, or by graduate students and postdocs who may not stay in the field. As a consequence, a lot of code that works fine is nonetheless not actively maintained, or only infrequently maintained. So it's likely that an awful lot of scientific developers will not be sufficiently involved in Python development to see this coming, or may not have the time to update their code. Similarly, while using raw-strings for docstrings containing backslashes has been in PEP-257 since 2001, most programmers in scientific disciplines have no idea that's the case, because docstrings have always just worked as plaintext. Why unnecessarily add an extra character that seemingly does nothing?

PEP-257 also states that the PEP contains conventions, not laws or syntax. This seems to be a move towards encoding PEP conventions as syntax rules. I would suggest that a better long-run approach is to treat docstrings as separate from all other strings (they already are) and always process them as a raw string literals. They behave like plaintext documentation, are used as plaintext documentation, and should therefore be processed that way. The only thing I could see that breaking is if someone put a byte literal in their docstring for some ungodly reason--as opposed to the current plan of action which will break a great many packages, and demand valuable labor from even more packages maintainers.

@vstinner
Copy link
Member Author

vstinner commented Dec 3, 2024

Since this issue is closed, I suggest you to open a topic on https://discuss.python.org/.

@umarbutler
Copy link

@alphaparrot I've created a thread on the Python forum. I hope you don't mind that I quoted your comment as well. You can find it here https://discuss.python.org/t/please-dont-break-invalid-escape-sequences/74134

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

10 participants