Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different behavior of C and Python impls of datetime.strftime with non-UTF-8-encodable strings #78662

Closed
izbyshev mannequin opened this issue Aug 23, 2018 · 8 comments
Closed
Assignees
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error

Comments

@izbyshev
Copy link
Mannequin

izbyshev mannequin commented Aug 23, 2018

BPO 34481
Nosy @brettcannon, @abalkin, @pitrou, @vstinner, @taleinat, @serhiy-storchaka, @pganssle, @izbyshev
PRs
  • bpo-34481: Fix surrogate-handling in strftime #8983
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2018-08-23.17:40:57.476>
    labels = ['extension-modules', '3.8', 'type-bug', '3.7']
    title = 'Different behavior of C and Python impls of datetime.strftime with non-UTF-8-encodable strings'
    updated_at = <Date 2018-09-05.15:38:31.222>
    user = 'https://github.com/izbyshev'

    bugs.python.org fields:

    activity = <Date 2018-09-05.15:38:31.222>
    actor = 'rhettinger'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Extension Modules']
    creation = <Date 2018-08-23.17:40:57.476>
    creator = 'izbyshev'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 34481
    keywords = ['patch']
    message_count = 8.0
    messages = ['323963', '324329', '324336', '324337', '324356', '324397', '324401', '324630']
    nosy_count = 8.0
    nosy_names = ['brett.cannon', 'belopolsky', 'pitrou', 'vstinner', 'taleinat', 'serhiy.storchaka', 'p-ganssle', 'izbyshev']
    pr_nums = ['8983']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue34481'
    versions = ['Python 3.6', 'Python 3.7', 'Python 3.8']

    Linked PRs

    @izbyshev
    Copy link
    Mannequin Author

    izbyshev mannequin commented Aug 23, 2018

    The C datetime implementation uses PyUnicode_AsUTF8AndSize() in wrap_strftime() and rejects strings containing surrogate code points (0xD800 - 0xDFFF) since they can't be encoded in UTF-8. On the other hand, the pure-Python datetime implementation doesn't have this restriction:

    >>> import sys
    >>> sys.modules['_datetime'] = None # block C implementation
    >>> from datetime import time
    >>> time().strftime('\ud800')
    '\ud800'
    >>> del sys.modules['datetime']
    >>> del sys.modules['_datetime']
    >>> from datetime import time
    >>> time().strftime('\ud800')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

    @izbyshev izbyshev mannequin added 3.7 (EOL) end of life 3.8 (EOL) end of life extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error labels Aug 23, 2018
    @pganssle
    Copy link
    Member

    I'm finding it very difficult to reconcile these things. I'm not entirely sure, but we may need to take a performance hit in normal strftime if we want to make this work with surrogate characters, which really does not appeal to me (though we can certainly improve to some degree).

    One major question here: Is anyone (@vstinner, @belopolsky?) know why time's strftime opportunistically uses wcsftime instead of strftime? It makes the code *way* more complicated and difficult to read / maintain - are there platforms that provide wcstrftime and not strftime?

    Also, related, it seems according to https://bugs.python.org/issue10653#msg243660 that there may have been a regression in issue bpo-10653, which may be related here.

    Either way, some note should probably be made in the code to clarify exactly *why* these choices were made in the code, in case the situation has changed.

    @vstinner
    Copy link
    Member

    Because caring of surrogates, I would prefer to first fix time.strftime() and time.strptime() for legit time zone names. Currently, Python uses the C function strfime() because the string produced by wcsftime() cannot be parsed later. We should use the native Windows API to get the time zone name properly encoded. I don't recall the details, sorry. There is maybe an open issue for that one.

    @vstinner
    Copy link
    Member

    Because caring of surrogates,

    Before*

    @izbyshev
    Copy link
    Mannequin Author

    izbyshev mannequin commented Aug 29, 2018

    @p-ganssle

    I'm finding it very difficult to reconcile these things.

    Paul, this issue was only about reconciling C and Python implementations of datetime module -- not fixing time.strftime(), which both of them delegate to. While the latter is certainly more important, it's also much more difficult, and IMO needs another issue report (maybe merged with bpo-34512). As for the former, it seems that you've already fixed it in your PR.

    @pganssle
    Copy link
    Member

    @izbyshev That's totally fair and I wouldn't want to make it a condition of merging the existing fixes - I've already made great progress in fixing the time.strftime part as well.

    The main reason it relates here is that I generally find the tests to be among the hardest part about writing a good PR, and if we can't make assertions about the behavior of strftime outputs, I think it makes it hard to prevent regressions. I figured if I can solve the problem all the way down the stack in one go, I might as well.

    That said, Victor makes an *extremely* good point that this is an outsized effort for the bug it's fixing. No one really *needs* support for unpaired surrogates in their strftime as far as I can tell. The main reason I'm still working on it is that I'm curious to see if it's even possible to fix.

    @izbyshev
    Copy link
    Mannequin Author

    izbyshev mannequin commented Aug 30, 2018

    if we can't make assertions about the behavior of strftime outputs, I think it makes it hard to prevent regressions.

    Ironically, some of the changes we may have to make to fix time.strftime() inconsistencies may appear like regressions to some users. For example, regarding dropping wcsftime() on all platforms, 'man strftime' on macOS contains the following warning in BUGS section: "The strftime() function does not correctly handle multibyte characters in the format argument.". (I'm not sure what "incorrect handling" means here). That's not to discourage you, just to point out that we should be extra careful at this point when there might be users relying on every variety of the current behavior. And that's also why I don't want to mix the fix for C vs. Python issue (low risk) with changes in time.strftime() (high risk).

    Also, while I certainly agree with "the outsized effort" point, it seems that your PR goes far beyond supporting surrogates because falling back to escaping allows one to bypass checks in wcsftime()/strftime() and round-trip any code point regardless of the current locale.

    @pganssle
    Copy link
    Member

    pganssle commented Sep 5, 2018

    I've left a mostly finished PR on bpo-8983, but I've decided it's not really worth continuing to work on. Anyone can feel free to take it and run with it if they want to implement the fix for this.

    As Alexey mentions, that PR indeed already fixes this bug, I mainly blocked on getting consistent behavior across all platforms. Currently that PR has consistent behavior across all platforms if you disable wcsftime *except* Ubuntu 14.0, which seems to have some kind of funky bug in wcstombs, though for the life of me I don't know how to consistently trigger it. I suspect that using strftime instead of wcsftime on all platforms would lead to simpler code that works most consistently, but I am guessing wcsftime is faster.

    Another option is to use wsftime but fall back to strftime on MacOS in the event that the error condition (blank output from non-blank input) is detected.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @serhiy-storchaka serhiy-storchaka self-assigned this Oct 5, 2024
    @serhiy-storchaka serhiy-storchaka added 3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes and removed 3.8 (EOL) end of life 3.7 (EOL) end of life labels Oct 5, 2024
    serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Oct 9, 2024
    Fix time.strftime(), the strftime() method and formatting of the
    datetime classes datetime, date and time.
    
    * Characters not encodable in the current locale are now acceptable in
      the format string.
    * Surrogate pairs and sequence of surrogatescape-encoded bytes are no
      longer recombinated.
    * Embedded null character no longer terminates the format string.
    
    This fixes also pythongh-78662 and pythongh-124531.
    serhiy-storchaka added a commit that referenced this issue Oct 17, 2024
    Fix time.strftime(), the strftime() method and formatting of the
    datetime classes datetime, date and time.
    
    * Characters not encodable in the current locale are now acceptable in
      the format string.
    * Surrogate pairs and sequence of surrogatescape-encoded bytes are no
      longer recombinated.
    * Embedded null character no longer terminates the format string.
    
    This fixes also gh-78662 and gh-124531.
    serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Oct 17, 2024
    …5193)
    
    Fix time.strftime(), the strftime() method and formatting of the
    datetime classes datetime, date and time.
    
    * Characters not encodable in the current locale are now acceptable in
      the format string.
    * Surrogate pairs and sequence of surrogatescape-encoded bytes are no
      longer recombinated.
    * Embedded null character no longer terminates the format string.
    
    This fixes also pythongh-78662 and pythongh-124531.
    (cherry picked from commit ad3eac1)
    
    Co-authored-by: Serhiy Storchaka <[email protected]>
    serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Oct 17, 2024
    Fix time.strftime(), the strftime() method and formatting of the
    datetime classes datetime, date and time.
    
    * Characters not encodable in the current locale are now acceptable in
      the format string.
    * Surrogate pairs and sequence of surrogatescape-encoded bytes are no
      longer recombinated.
    * Embedded null character no longer terminates the format string.
    
    This fixes also pythongh-78662 and pythongh-124531.
    
    (cherry picked from commit ad3eac1)
    serhiy-storchaka added a commit that referenced this issue Oct 17, 2024
    …5657)
    
    Fix time.strftime(), the strftime() method and formatting of the
    datetime classes datetime, date and time.
    
    * Characters not encodable in the current locale are now acceptable in
      the format string.
    * Surrogate pairs and sequence of surrogatescape-encoded bytes are no
      longer recombinated.
    * Embedded null character no longer terminates the format string.
    
    This fixes also gh-78662 and gh-124531.
    
    (cherry picked from commit ad3eac1)
    miss-islington pushed a commit to miss-islington/cpython that referenced this issue Oct 17, 2024
    …5193) (pythonGH-125657)
    
    Fix time.strftime(), the strftime() method and formatting of the
    datetime classes datetime, date and time.
    
    * Characters not encodable in the current locale are now acceptable in
      the format string.
    * Surrogate pairs and sequence of surrogatescape-encoded bytes are no
      longer recombinated.
    * Embedded null character no longer terminates the format string.
    
    This fixes also pythongh-78662 and pythongh-124531.
    
    (cherry picked from commit 08ccbb9)
    
    Co-authored-by: Serhiy Storchaka <[email protected]>
    (cherry picked from commit ad3eac1)
    serhiy-storchaka added a commit that referenced this issue Oct 17, 2024
    …5657) (GH-125661)
    
    Fix time.strftime(), the strftime() method and formatting of the
    datetime classes datetime, date and time.
    
    * Characters not encodable in the current locale are now acceptable in
      the format string.
    * Surrogate pairs and sequence of surrogatescape-encoded bytes are no
      longer recombinated.
    * Embedded null character no longer terminates the format string.
    
    This fixes also gh-78662 and gh-124531.
    
    (cherry picked from commit 08ccbb9)
    (cherry picked from commit ad3eac1)
    
    Co-authored-by: Serhiy Storchaka <[email protected]>
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes extension-modules C modules in the Modules dir type-bug An unexpected behavior, bug, or error
    Projects
    Archived in project
    Development

    No branches or pull requests

    3 participants