-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate invalid escape sequences in str/bytes #71551
Comments
Attached patch deprecates invalid escape sequences in unicode strings. The point of this is to prevent issues such as bpo-27356 (and possibly other similar ones) in the future. Without the patch: >>> "hello \world"
'hello \\world' With the patch: >>> "hello \world"
DeprecationWarning: invalid escape sequence 'w' I'll need some help (patch isn't mergeable yet): test_doctest fails on my machine with the patch (and -W), and I don't know how to fix it. test_ast fails an assertion (!PyErr_Occurred() in PyObject_Call in abstract.c) when -W is on, and I also don't know how to fix it (I don't even know what causes it). Of course, I went ahead and fixed all instances of invalid escape sequences in the stdlib (that I could find) so that no DeprecationWarning is encountered. Lastly, I thought about also doing this to bytes, but I ran into some issues with some invalid escapes such as \u, and _codecs.escape_decode would trigger the warning when passed br"\8" (for example). Ultimately, I decided to leave bytes alone for now, since it's mostly on the lower-level side of things. If there's interest I can add it back. |
Have you searched the python-dev and python-ideas archives for the previous discussions of this issue? I don't remember for sure, but I think Guido might have made a ruling (not that the discussion couldn't be reopened if he has, but, well...) |
Now I have! I found nothing on Python-Dev, but apparently it's been discussed on Python-ideas before: https://mail.python.org/pipermail/python-ideas/2015-August/035031.html Guido hasn't participated in that discussion, and most of it was "This will break people's code", with people both for and against the idea, without an apparent consensus. Should I try a second round on Python-ideas, to try and get a consensus (or a BDFL ruling)? |
it is handy to be able to use |
Yes, it's in use in an awful lot of places (see my patch). The proper fix is to use raw strings, or, if you need actual escapes in the same string, manually escape them. However, as you'll see by looking at the patch, the vast majority of cases are fixed by prepending a single 'r' to the front of the string. In fact, only csv.py and html/parser.py needed more finer-grained escaping. I think that the argument "It works in non-raw strings" is weak. I've always used raw strings for regular expressions, and this patch would simply move this from being a style issue to being a syntax one (and I think it's fine :). |
There was a long discussion on Python-Dev. [1] Guido taken part in it. [1] http://comments.gmane.org/gmane.comp.python.devel/151612 |
Thanks, didn't find that one. Apparently Guido's stance is "Make this a silent warning, then we can discuss about preventing it later", which happens to be what I'm doing here. |
I found the cause of the failed assertion, an invalid escape sequence slipped through in a file. Patch attached (also with Serhiy's comments). It worries me a little though that pure Python code can cause a hard crash. Ok, it worries me a lot. Please don't merge this until it's fixed. I'm guessing this is a combination of unittest catching warnings and compiling the faulty source file. As to why a malformed node (i.e. one that raised a DeprecationWarning) managed to pass through unharmed is beyond me. |
I am okay with making it a silent warning. Can we do it in two stages though? It doesn't have to be two releases, I just mean two separate commits: (1) fix all places in the stdlib that violate this principle; (2) separately commit the code that causes the silent deprecation (and tests for it). What exactly was the hard crash you got? Do you think it was a bug in your own C code or in existing C code? |
I originally considered making two different patches, so there you go. deprecate_invalid_escapes_only_1.patch has the deprecation plus a test, and invalid_stdlib_escapes_1.patch fixes all invalid escapes in the stdlib. My code was the cause, although no directly; it was 'assert(!PyErr_Occurred())' at the beginning of PyObject_Call in Objects/abstract.c which failed. This happened when I ran the whole test suite (although just running test_ast was fine to reproduce it) with the '-W error' command line switch. One stdlib module (I don't remember which one) had one single invalid escape sequence in it, and then test_ast.ASTValidatorTests.test_stdlib_validates triggered the failed assertion. Fixing the invalid escape removes the failure and all tests pass. One can reliably reproduce the crash with the patch by adding a string with an invalid escape in any of the stdlib files (and running with '-W error'): No invalid sequence: >>> import unittest, test.test_ast
>>> unittest.main(test.test_ast)
.............................................................................. Ran 78 tests in 5.538s OK With an invalid sequence in a file: >>> import unittest, test.test_ast
>>> unittest.main(test.test_ast)
............................................Fatal Python error: a function returned a result with an error set
DeprecationWarning: invalid escape sequence 'w'
During handling of the above exception, another exception occurred: SystemError: <built-in function compile> returned a result with an error set Current thread 0x00001ba0 (most recent call first): Then I get the usual "Python has stopped working" Windows prompt (strangely enough, before I'd get a prompt saying "Assertion failed" with the line, but not this time). I'm not sure where the error lies exactly. Should I open another issue for that? |
Hm, if you manage to trigger an assert() in the C code by writing some evil Maybe someone who has worked more with the C code recently could help you |
Ah right, assert() is only enabled in debug mode, I forgot that. My (very uneducated) guess is that compile() got the error (which was a warning) but then decided to return a value anyway, and the next thing that tries to call anything crashes Python. I opened bpo-27394 to get some experts' advice. |
Aaand I feel pretty stupid; I didn't check the return value of PyErr_WarnFormat, so it was my mistake. Attached new patch, actually done right this time. |
Hello Emanual, I think I have fixed your problem with -Werror, by handling the exception returned by PyErr_WarnFormat() (see my patch). Thanks for separating the actual change from the escape violation fixes; it made it easier to spot the real problem :) Also, I like the general idea of the change. It would be good to update the documentation as well (e.g. What’s New, and <https://docs.python.org/3.6/reference/lexical_analysis.html#string-and-bytes-literals\>). It would be good to do the same for byte string literals, at least to keep things consistent. What did you try so far? Do you have a partial patch for it? |
Hah, we posted the same fix almost at the same time :) |
Indeed, we did, thanks for letting me know my mistake :) I didn't get very far into making bytes literal disallow invalid sequences, as I ran into issues with _codecs.escape_decode throwing the warning even when the literal was fine, and I think I stopped there and figured I'd at least post that patch and see if people are interested in extending that modification to bytes (turns out so). I forgot about docs, will do so soon, but I'll try to extend the patch for bytes first. I'll see if I can make literals warn but not e.g. _codecs.escape_decode (or anything else, really). Thanks! |
Code samples in the documentation should also be fixed, like at <https://docs.python.org/3.6/library/re.html#re.split\>. I think you can run “make -C Doc doctest” or something similar, which may help find some of these. Also, playing with your current patch, it seems to affect the “unicode-escape” codec. Not sure if that is a problem, but it probably deserves also documenting the change. |
Guido: "I am okay with making it a silent warning." The current patch raises a DeprecationWarning which is silent by default, but seen using python3 -Wd. What is the "long term" plan: always raise an *exception* in Python 3.7? Which exception? Another option is to always emit a SyntaxWarning, but don't raise an exception in long term. It is possible to get an exception using python3 -Werror. There is also FutureWarning: "Base class for warnings about constructs that will change semantically in the future" or RuntimeWarning "Base class for warnings about dubious runtime behavior". |
DeprecationWarning is used when we want to remove a feature. It becomes an error in the future. FutureWarning is used when we want change the meaning of a feature instead of removing it. For example re.split(':*', 'a:bc') emits a FutureWarning and returns ['a', 'bc'] because there is a plan to make it returning ['', 'a', 'b', 'c', '']. I think "a silent warning" means that it should emit a DeprecationWarning or a PendingDeprecationWarning. Since there is no haste, we should use 2-releases deprecation period. After this a deprecation can be changed to a SynataxWarning in 3.8 and to a UnicodeDecodeError (for strings) and a ValueError (for bytes) in 4.0. The latter are converted to SyntaxError by parser. At the end we should get the same behavior as for truncated \x and \u escapes. >>> '\u'
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
>>> b'\x'
File "<stdin>", line 1
SyntaxError: (value error) invalid \x escape at position 0 Maybe change a parser to convert warnings to a SyntaxWarning? |
I think ultimately a SyntaxError should be fine. I don't know *when* it becomes appropriate to change a warning into an error; I was thinking 3.7 but, as Serhiy said, there's no rush. I think waiting five release cycles is overkill though, that means the error won't be until 8 years from now (assuming release cycle periods don't change)! I think at most 3.8 should be fine for making this a full-on syntax error. |
@ebarry: To move faster, you should also worker with linters (pylint, pychecker, pyflakes, pycodestyle, flake8, ...) to log a warning to help projects to be prepared this change. linters are used on Python 2-only projects, so it will help them to be prepared to the final Python 3.<n> which will raise an exception. |
Yes, this change is likely to break a lot of code, so an extended deprecation period (certainly longer than 3.7, which Guido has already mandated) is the minimum). Guido hasn't agreed to making it an error yet, as far as I can see ;) |
I think ultimately it has to become an error (otherwise I wouldn't Contacting the PyCQA folks would also be a great idea -- can anyone |
Easing transition is always a good idea. I'll contact the PyCQA people later today when I'm back home. On afterthought, it makes sense to wait more than two release cycles before making this an error. I don't really have a strong opinion when exactly that should happen. |
Just brought this to the attention of the code-quality mailing list, so linter maintainers should (hopefully!) catch up soon. Also new patch, I forgot to add '\c' in the tests. |
Forgot to say I reviewed invalid_stdlib_escapes_1.patch the other day and can’t see any problems. |
+1 on getting this in. Who can help reviewing and merging before beta 1? |
Thank you R. David for the review, here's a new patch with the one change. |
I suggest to not change fixcid.py. It is not correct and there is special issue for this (bpo-27952). |
All right, since you'll work on it I'm leaving it out. Removed it and test_bytes (which you already fixed, thanks!) from new patch. |
New changeset b4cc62473c13 by R David Murray in branch 'default': |
Here's a copy of Emanuel's deprecation patch with a versionchanged note in the lexical docs and a whatsnew entry. |
New changeset 38802c38cfe1 by R David Murray in branch 'default': |
Thank you David for taking the time to review and commit this :) |
Thanks Emanuel. No bets on how much hate mail we get for this :) |
Thank you all for persisting on this. I have seen numerous beginners be puzzled why normal (cooked) strings using '\' for Windows paths sometimes work and sometimes 'mysteriously' do not, as in the initially referenced issue. I also think it better to consistently use 'r' for REs with '\' intended to be passed through to re. (And I pushed some of the IDLE code that was patched.) |
New changeset 60085c8f01fe by R David Murray in branch 'default': |
New changeset 98a57845c8cc by Martin Panter in branch 'default': |
Currently the deprecation message is not so useful when fixing lots of files in a large project. For example, I have two files foo.py and bar.py: # foo.py
import bar
# bar.py
print('\d') It gives:
$ python3.6 -W error foo.py
Traceback (most recent call last):
File "foo.py", line 1, in <module>
import bar
DeprecationWarning: invalid escape sequence '\d' Things are worse when __import__, imp or importlib are involved. I have to add some codes to show which module is imported. It would be better to have at least filenames and line numbers:
$ ./python -W error foo.py
Traceback (most recent call last):
File "foo.py", line 1, in <module>
import bar
File "/home/yen/Projects/cpython/build/bar.py", line 1
print('\d')
^
SyntaxError: (deprecated usage) invalid escape sequence '\d' I have a naive try that prints more information. Raising SyntaxError may not be a good idea, anyway. |
Fair enough, but please open a new issue for that. @terry - you're welcome; that's exactly the reason I pushed for it :) |
Opened a new issue at bpo-28128. |
One consequence of this change is that now any string that has a backslash needs to be escaped or raw, leading to changes like this on (cherrypy/cherrypy@1d8c03e#diff-be33a4f55d59dfc70fc6452482f3a7a4) where the diagram in the docstring is the culprit. An escaped backslash is not viable in this case, so a raw string is required. This particular example strikes me as counter-intuitive, though maybe I just need to adjust my intuition. Was the intention for a docstring like above to use raw strings? |
Yes. |
Yes, this was the intention. One of often errors is using "\n" in non-raw docstrings. This change doesn't prevent this error, but increases chances of catching it when there are other backslashes in the docstring. |
Also note that we have fixed a number of bugs in the stdlib code where a raw string was not used for a docstring when it should have been. And when I say bugs, I mean both formatting problems in pydoc, and doctest bugs. There may even have been a case where it produced a code bug, but I'm not sure I'm recalling that correctly :) So yes, requiring that a docstring containing backslashes be marked as a raw string is very intentional. |
I have created bpo-32912 as a follow-up to this issue for 3.8. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: