Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate invalid escape sequences in str/bytes #71551

Closed
Vgr255 mannequin opened this issue Jun 21, 2016 · 56 comments
Closed

Deprecate invalid escape sequences in str/bytes #71551

Vgr255 mannequin opened this issue Jun 21, 2016 · 56 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@Vgr255
Copy link
Mannequin

Vgr255 mannequin commented Jun 21, 2016

BPO 27364
Nosy @gvanrossum, @terryjreedy, @jaraco, @vstinner, @ezio-melotti, @bitdancer, @vadmium, @serhiy-storchaka, @ztane, @jayvdb, @yan12125, @Vgr255
Files
  • deprecate_invalid_unicode_escapes.patch
  • deprecate_invalid_unicode_escapes_2.patch
  • deprecate_invalid_escapes_only_1.patch
  • invalid_stdlib_escapes_1.patch
  • deprecate_invalid_escapes_only_2.patch
  • deprecate_invalid_escapes_only_2.patch: Error handling
  • deprecate_invalid_escapes_only_3.patch
  • deprecate_invalid_escapes_both_1.patch
  • invalid_stdlib_escapes_2.patch
  • deprecate_invalid_escapes_both_2.patch
  • deprecate_invalid_escapes_both_3.patch
  • invalid_stdlib_escapes_3.patch
  • deprecate_invalid_escapes_both_4.patch
  • invalid_stdlib_escapes_3_regenerated.patch
  • invalid_stdlib_escapes_3_rebased_2.patch
  • invalid_stdlib_escapes_4.patch
  • invalid_stdlib_escapes_5.patch
  • deprecate_invalid_escapes_both_5.patch
  • verbose-deprecation.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2016-09-08.19:35:58.631>
    created_at = <Date 2016-06-21.20:34:19.971>
    labels = ['interpreter-core', 'type-bug', 'library', 'expert-unicode']
    title = 'Deprecate invalid escape sequences in str/bytes'
    updated_at = <Date 2018-02-22.18:36:32.149>
    user = 'https://github.com/Vgr255'

    bugs.python.org fields:

    activity = <Date 2018-02-22.18:36:32.149>
    actor = 'abarry'
    assignee = 'none'
    closed = True
    closed_date = <Date 2016-09-08.19:35:58.631>
    closer = 'abarry'
    components = ['Interpreter Core', 'Library (Lib)', 'Unicode']
    creation = <Date 2016-06-21.20:34:19.971>
    creator = 'abarry'
    dependencies = []
    files = ['43499', '43527', '43549', '43550', '43552', '43553', '43569', '43777', '44107', '44108', '44322', '44382', '44432', '44435', '44457', '44463', '44465', '44475', '44550']
    hgrepos = []
    issue_num = 27364
    keywords = ['patch']
    message_count = 56.0
    messages = ['269022', '269114', '269119', '269122', '269152', '269155', '269156', '269158', '269322', '269323', '269326', '269329', '269332', '269333', '269334', '269335', '269340', '269358', '269368', '269372', '269373', '269376', '269382', '269388', '269413', '269416', '270765', '272439', '272441', '272696', '274119', '274120', '274126', '274332', '274475', '274806', '274837', '274999', '275009', '275010', '275084', '275111', '275123', '275124', '275125', '275219', '275237', '275298', '275757', '276016', '276287', '298112', '298114', '298115', '298170', '312576']
    nosy_count = 13.0
    nosy_names = ['gvanrossum', 'terry.reedy', 'jaraco', 'vstinner', 'ezio.melotti', 'r.david.murray', 'python-dev', 'martin.panter', 'serhiy.storchaka', 'ztane', 'jayvdb', 'yan12125', 'abarry']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue27364'
    versions = ['Python 3.6']

    @Vgr255
    Copy link
    Mannequin Author

    Vgr255 mannequin commented Jun 21, 2016

    Attached patch deprecates invalid escape sequences in unicode strings. The point of this is to prevent issues such as bpo-27356 (and possibly other similar ones) in the future.

    Without the patch:

    >>> "hello \world"
    'hello \\world'

    With the patch:

    >>> "hello \world"
    DeprecationWarning: invalid escape sequence 'w'

    I'll need some help (patch isn't mergeable yet):

    test_doctest fails on my machine with the patch (and -W), and I don't know how to fix it. test_ast fails an assertion (!PyErr_Occurred() in PyObject_Call in abstract.c) when -W is on, and I also don't know how to fix it (I don't even know what causes it).

    Of course, I went ahead and fixed all instances of invalid escape sequences in the stdlib (that I could find) so that no DeprecationWarning is encountered.

    Lastly, I thought about also doing this to bytes, but I ran into some issues with some invalid escapes such as \u, and _codecs.escape_decode would trigger the warning when passed br"\8" (for example). Ultimately, I decided to leave bytes alone for now, since it's mostly on the lower-level side of things. If there's interest I can add it back.

    @Vgr255 Vgr255 mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error labels Jun 21, 2016
    @bitdancer
    Copy link
    Member

    Have you searched the python-dev and python-ideas archives for the previous discussions of this issue? I don't remember for sure, but I think Guido might have made a ruling (not that the discussion couldn't be reopened if he has, but, well...)

    @Vgr255
    Copy link
    Mannequin Author

    Vgr255 mannequin commented Jun 23, 2016

    Now I have! I found nothing on Python-Dev, but apparently it's been discussed on Python-ideas before: https://mail.python.org/pipermail/python-ideas/2015-August/035031.html Guido hasn't participated in that discussion, and most of it was "This will break people's code", with people both for and against the idea, without an apparent consensus.

    Should I try a second round on Python-ideas, to try and get a consensus (or a BDFL ruling)?

    @ztane
    Copy link
    Mannequin

    ztane mannequin commented Jun 23, 2016

    it is handy to be able to use \w and \d in non-raw-string regular expressions, without too much backslashitis. Seems to be in use in Python standard library as well, for example in csv.py

    @Vgr255
    Copy link
    Mannequin Author

    Vgr255 mannequin commented Jun 24, 2016

    Yes, it's in use in an awful lot of places (see my patch). The proper fix is to use raw strings, or, if you need actual escapes in the same string, manually escape them. However, as you'll see by looking at the patch, the vast majority of cases are fixed by prepending a single 'r' to the front of the string. In fact, only csv.py and html/parser.py needed more finer-grained escaping.

    I think that the argument "It works in non-raw strings" is weak. I've always used raw strings for regular expressions, and this patch would simply move this from being a style issue to being a syntax one (and I think it's fine :).

    @serhiy-storchaka
    Copy link
    Member

    There was a long discussion on Python-Dev. [1] Guido taken part in it.

    [1] http://comments.gmane.org/gmane.comp.python.devel/151612

    @Vgr255
    Copy link
    Mannequin Author

    Vgr255 mannequin commented Jun 24, 2016

    Thanks, didn't find that one. Apparently Guido's stance is "Make this a silent warning, then we can discuss about preventing it later", which happens to be what I'm doing here.

    @Vgr255
    Copy link
    Mannequin Author

    Vgr255 mannequin commented Jun 24, 2016

    I found the cause of the failed assertion, an invalid escape sequence slipped through in a file. Patch attached (also with Serhiy's comments).

    It worries me a little though that pure Python code can cause a hard crash. Ok, it worries me a lot. Please don't merge this until it's fixed. I'm guessing this is a combination of unittest catching warnings and compiling the faulty source file. As to why a malformed node (i.e. one that raised a DeprecationWarning) managed to pass through unharmed is beyond me.

    @gvanrossum
    Copy link
    Member

    I am okay with making it a silent warning.

    Can we do it in two stages though? It doesn't have to be two releases, I just mean two separate commits: (1) fix all places in the stdlib that violate this principle; (2) separately commit the code that causes the silent deprecation (and tests for it).

    What exactly was the hard crash you got? Do you think it was a bug in your own C code or in existing C code?

    @Vgr255
    Copy link
    Mannequin Author

    Vgr255 mannequin commented Jun 26, 2016

    I originally considered making two different patches, so there you go. deprecate_invalid_escapes_only_1.patch has the deprecation plus a test, and invalid_stdlib_escapes_1.patch fixes all invalid escapes in the stdlib.

    My code was the cause, although no directly; it was 'assert(!PyErr_Occurred())' at the beginning of PyObject_Call in Objects/abstract.c which failed.

    This happened when I ran the whole test suite (although just running test_ast was fine to reproduce it) with the '-W error' command line switch. One stdlib module (I don't remember which one) had one single invalid escape sequence in it, and then test_ast.ASTValidatorTests.test_stdlib_validates triggered the failed assertion. Fixing the invalid escape removes the failure and all tests pass.

    One can reliably reproduce the crash with the patch by adding a string with an invalid escape in any of the stdlib files (and running with '-W error'):

    No invalid sequence:

    >>> import unittest, test.test_ast
    >>> unittest.main(test.test_ast)
    ..............................................................................

    Ran 78 tests in 5.538s

    OK

    With an invalid sequence in a file:

    >>> import unittest, test.test_ast
    >>> unittest.main(test.test_ast)
    ............................................Fatal Python error: a function returned a result with an error set
    DeprecationWarning: invalid escape sequence 'w'
    
    During handling of the above exception, another exception occurred:

    SystemError: <built-in function compile> returned a result with an error set

    Current thread 0x00001ba0 (most recent call first):
    File "E:\GitHub\cpython\lib\ast.py", line 35 in parse
    File "E:\GitHub\cpython\lib\test\test_ast.py", line 944 in test_stdlib_validates
    File "E:\GitHub\cpython\lib\unittest\case.py", line 600 in run
    File "E:\GitHub\cpython\lib\unittest\case.py", line 648 in __call__
    File "E:\GitHub\cpython\lib\unittest\suite.py", line 122 in run
    File "E:\GitHub\cpython\lib\unittest\suite.py", line 84 in __call__
    File "E:\GitHub\cpython\lib\unittest\suite.py", line 122 in run
    File "E:\GitHub\cpython\lib\unittest\suite.py", line 84 in __call__
    File "E:\GitHub\cpython\lib\unittest\runner.py", line 176 in run
    File "E:\GitHub\cpython\lib\unittest\main.py", line 255 in runTests
    File "E:\GitHub\cpython\lib\unittest\main.py", line 94 in __init__
    File "<stdin>", line 1 in <module>

    Then I get the usual "Python has stopped working" Windows prompt (strangely enough, before I'd get a prompt saying "Assertion failed" with the line, but not this time).

    I'm not sure where the error lies exactly. Should I open another issue for that?

    @gvanrossum
    Copy link
    Member

    Hm, if you manage to trigger an assert() in the C code by writing some evil
    Python code, the C code is considered broken (unless it was using ctypes or
    one or two other explicit "void-the-warranty" exceptions).

    Maybe someone who has worked more with the C code recently could help you
    dig into this more; my memory is unreliable when it comes to these details.
    Maybe assert() calls are disabled by default? In general the error "...
    returned a result with an error set" means there's a problem at the C level
    where a function should have either returned an object or returned NULL
    with the per-thread exception state set, but it was found to return an
    object *and* set the exception state. IIRC only debug mode checks for that,
    so such a bug occasionally creeps into the code. But you shouldn't assume
    everything is fine until you've tracked down the cause.

    @Vgr255
    Copy link
    Mannequin Author

    Vgr255 mannequin commented Jun 27, 2016

    Ah right, assert() is only enabled in debug mode, I forgot that. My (very uneducated) guess is that compile() got the error (which was a warning) but then decided to return a value anyway, and the next thing that tries to call anything crashes Python. I opened bpo-27394 to get some experts' advice.

    @Vgr255
    Copy link
    Mannequin Author

    Vgr255 mannequin commented Jun 27, 2016

    Aaand I feel pretty stupid; I didn't check the return value of PyErr_WarnFormat, so it was my mistake. Attached new patch, actually done right this time.

    @vadmium
    Copy link
    Member

    vadmium commented Jun 27, 2016

    Hello Emanual, I think I have fixed your problem with -Werror, by handling the exception returned by PyErr_WarnFormat() (see my patch). Thanks for separating the actual change from the escape violation fixes; it made it easier to spot the real problem :)

    Also, I like the general idea of the change. It would be good to update the documentation as well (e.g. What’s New, and <https://docs.python.org/3.6/reference/lexical_analysis.html#string-and-bytes-literals\>).

    It would be good to do the same for byte string literals, at least to keep things consistent. What did you try so far? Do you have a partial patch for it?

    @vadmium
    Copy link
    Member

    vadmium commented Jun 27, 2016

    Hah, we posted the same fix almost at the same time :)

    @Vgr255
    Copy link
    Mannequin Author

    Vgr255 mannequin commented Jun 27, 2016

    Indeed, we did, thanks for letting me know my mistake :) I didn't get very far into making bytes literal disallow invalid sequences, as I ran into issues with _codecs.escape_decode throwing the warning even when the literal was fine, and I think I stopped there and figured I'd at least post that patch and see if people are interested in extending that modification to bytes (turns out so).

    I forgot about docs, will do so soon, but I'll try to extend the patch for bytes first. I'll see if I can make literals warn but not e.g. _codecs.escape_decode (or anything else, really).

    Thanks!

    @vadmium
    Copy link
    Member

    vadmium commented Jun 27, 2016

    Code samples in the documentation should also be fixed, like at <https://docs.python.org/3.6/library/re.html#re.split\>. I think you can run “make -C Doc doctest” or something similar, which may help find some of these.

    Also, playing with your current patch, it seems to affect the “unicode-escape” codec. Not sure if that is a problem, but it probably deserves also documenting the change.

    @vstinner
    Copy link
    Member

    Guido: "I am okay with making it a silent warning."

    The current patch raises a DeprecationWarning which is silent by default, but seen using python3 -Wd. What is the "long term" plan: always raise an *exception* in Python 3.7? Which exception?

    Another option is to always emit a SyntaxWarning, but don't raise an exception in long term. It is possible to get an exception using python3 -Werror.

    There is also FutureWarning: "Base class for warnings about constructs that will change semantically in the future" or RuntimeWarning "Base class for warnings about dubious runtime behavior".

    @serhiy-storchaka
    Copy link
    Member

    DeprecationWarning is used when we want to remove a feature. It becomes an error in the future. FutureWarning is used when we want change the meaning of a feature instead of removing it. For example re.split(':*', 'a:bc') emits a FutureWarning and returns ['a', 'bc'] because there is a plan to make it returning ['', 'a', 'b', 'c', ''].

    I think "a silent warning" means that it should emit a DeprecationWarning or a PendingDeprecationWarning. Since there is no haste, we should use 2-releases deprecation period. After this a deprecation can be changed to a SynataxWarning in 3.8 and to a UnicodeDecodeError (for strings) and a ValueError (for bytes) in 4.0. The latter are converted to SyntaxError by parser. At the end we should get the same behavior as for truncated \x and \u escapes.

    >>> '\u'
      File "<stdin>", line 1
    SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
    >>> b'\x'
      File "<stdin>", line 1
    SyntaxError: (value error) invalid \x escape at position 0

    Maybe change a parser to convert warnings to a SyntaxWarning?

    @Vgr255
    Copy link
    Mannequin Author

    Vgr255 mannequin commented Jun 27, 2016

    I think ultimately a SyntaxError should be fine. I don't know *when* it becomes appropriate to change a warning into an error; I was thinking 3.7 but, as Serhiy said, there's no rush. I think waiting five release cycles is overkill though, that means the error won't be until 8 years from now (assuming release cycle periods don't change)! I think at most 3.8 should be fine for making this a full-on syntax error.

    @vstinner
    Copy link
    Member

    @ebarry: To move faster, you should also worker with linters (pylint, pychecker, pyflakes, pycodestyle, flake8, ...) to log a warning to help projects to be prepared this change. linters are used on Python 2-only projects, so it will help them to be prepared to the final Python 3.<n> which will raise an exception.

    @bitdancer
    Copy link
    Member

    Yes, this change is likely to break a lot of code, so an extended deprecation period (certainly longer than 3.7, which Guido has already mandated) is the minimum). Guido hasn't agreed to making it an error yet, as far as I can see ;)

    @gvanrossum
    Copy link
    Member

    I think ultimately it has to become an error (otherwise I wouldn't
    have agreed to the warning, silent or not). But because there's so
    much 3rd party code that depends on it we indeed need to take
    "several" releases before we go there.

    Contacting the PyCQA folks would also be a great idea -- can anyone
    volunteer to do so?

    @Vgr255
    Copy link
    Mannequin Author

    Vgr255 mannequin commented Jun 27, 2016

    Easing transition is always a good idea. I'll contact the PyCQA people later today when I'm back home.

    On afterthought, it makes sense to wait more than two release cycles before making this an error. I don't really have a strong opinion when exactly that should happen.

    @Vgr255
    Copy link
    Mannequin Author

    Vgr255 mannequin commented Jun 28, 2016

    Just brought this to the attention of the code-quality mailing list, so linter maintainers should (hopefully!) catch up soon.

    Also new patch, I forgot to add '\c' in the tests.

    @vadmium
    Copy link
    Member

    vadmium commented Jun 28, 2016

    Forgot to say I reviewed invalid_stdlib_escapes_1.patch the other day and can’t see any problems.

    @gvanrossum
    Copy link
    Member

    +1 on getting this in. Who can help reviewing and merging before beta 1?

    @Vgr255
    Copy link
    Mannequin Author

    Vgr255 mannequin commented Sep 8, 2016

    Thank you R. David for the review, here's a new patch with the one change.

    @serhiy-storchaka
    Copy link
    Member

    I suggest to not change fixcid.py. It is not correct and there is special issue for this (bpo-27952).

    @Vgr255
    Copy link
    Mannequin Author

    Vgr255 mannequin commented Sep 8, 2016

    All right, since you'll work on it I'm leaving it out. Removed it and test_bytes (which you already fixed, thanks!) from new patch.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 8, 2016

    New changeset b4cc62473c13 by R David Murray in branch 'default':
    bpo-27364: fix "incorrect" uses of escape character in the stdlib.
    https://hg.python.org/cpython/rev/b4cc62473c13

    @bitdancer
    Copy link
    Member

    Here's a copy of Emanuel's deprecation patch with a versionchanged note in the lexical docs and a whatsnew entry.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 8, 2016

    New changeset 38802c38cfe1 by R David Murray in branch 'default':
    bpo-27364: Deprecate invalid escape strings in str/byutes.
    https://hg.python.org/cpython/rev/38802c38cfe1

    @Vgr255
    Copy link
    Mannequin Author

    Vgr255 mannequin commented Sep 8, 2016

    Thank you David for taking the time to review and commit this :)

    @Vgr255 Vgr255 mannequin closed this as completed Sep 8, 2016
    @bitdancer
    Copy link
    Member

    Thanks Emanuel. No bets on how much hate mail we get for this :)

    @terryjreedy
    Copy link
    Member

    Thank you all for persisting on this. I have seen numerous beginners be puzzled why normal (cooked) strings using '\' for Windows paths sometimes work and sometimes 'mysteriously' do not, as in the initially referenced issue. I also think it better to consistently use 'r' for REs with '\' intended to be passed through to re. (And I pushed some of the IDLE code that was patched.)

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 9, 2016

    New changeset 60085c8f01fe by R David Murray in branch 'default':
    bpo-27364: Credit Emanuel Barry in NEWS item.
    https://hg.python.org/cpython/rev/60085c8f01fe

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 9, 2016

    New changeset 98a57845c8cc by Martin Panter in branch 'default':
    Issue bpo-27364: Raw strings to avoid deprecated escaping in com2ann.py
    https://hg.python.org/cpython/rev/98a57845c8cc

    @yan12125
    Copy link
    Mannequin

    yan12125 mannequin commented Sep 11, 2016

    Currently the deprecation message is not so useful when fixing lots of files in a large project. For example, I have two files foo.py and bar.py:

    # foo.py
    import bar
    
    # bar.py
    print('\d')
    It gives:
    $ python3.6 -W error foo.py
    Traceback (most recent call last):
      File "foo.py", line 1, in <module>
        import bar
    DeprecationWarning: invalid escape sequence '\d'

    Things are worse when __import__, imp or importlib are involved. I have to add some codes to show which module is imported.

    It would be better to have at least filenames and line numbers:
    $ ./python -W error foo.py
    Traceback (most recent call last):
      File "foo.py", line 1, in <module>
        import bar
      File "/home/yen/Projects/cpython/build/bar.py", line 1
        print('\d')
             ^
    SyntaxError: (deprecated usage) invalid escape sequence '\d'

    I have a naive try that prints more information. Raising SyntaxError may not be a good idea, anyway.

    @Vgr255
    Copy link
    Mannequin Author

    Vgr255 mannequin commented Sep 12, 2016

    Fair enough, but please open a new issue for that.

    @terry - you're welcome; that's exactly the reason I pushed for it :)

    @yan12125
    Copy link
    Mannequin

    yan12125 mannequin commented Sep 13, 2016

    Opened a new issue at bpo-28128.

    @jaraco
    Copy link
    Member

    jaraco commented Jul 11, 2017

    One consequence of this change is that now any string that has a backslash needs to be escaped or raw, leading to changes like this on (cherrypy/cherrypy@1d8c03e#diff-be33a4f55d59dfc70fc6452482f3a7a4) where the diagram in the docstring is the culprit. An escaped backslash is not viable in this case, so a raw string is required.

    This particular example strikes me as counter-intuitive, though maybe I just need to adjust my intuition.

    Was the intention for a docstring like above to use raw strings?

    @gvanrossum
    Copy link
    Member

    Yes.

    @serhiy-storchaka
    Copy link
    Member

    Yes, this was the intention. One of often errors is using "\n" in non-raw docstrings. This change doesn't prevent this error, but increases chances of catching it when there are other backslashes in the docstring.

    @bitdancer
    Copy link
    Member

    Also note that we have fixed a number of bugs in the stdlib code where a raw string was not used for a docstring when it should have been. And when I say bugs, I mean both formatting problems in pydoc, and doctest bugs. There may even have been a case where it produced a code bug, but I'm not sure I'm recalling that correctly :)

    So yes, requiring that a docstring containing backslashes be marked as a raw string is very intentional.

    @Vgr255
    Copy link
    Mannequin Author

    Vgr255 mannequin commented Feb 22, 2018

    I have created bpo-32912 as a follow-up to this issue for 3.8.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    7 participants