-
-
Notifications
You must be signed in to change notification settings - Fork 30.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different behavior of C and Python impls of datetime.strftime with non-UTF-8-encodable strings #78662
Comments
The C datetime implementation uses PyUnicode_AsUTF8AndSize() in wrap_strftime() and rejects strings containing surrogate code points (0xD800 - 0xDFFF) since they can't be encoded in UTF-8. On the other hand, the pure-Python datetime implementation doesn't have this restriction: >>> import sys
>>> sys.modules['_datetime'] = None # block C implementation
>>> from datetime import time
>>> time().strftime('\ud800')
'\ud800'
>>> del sys.modules['datetime']
>>> del sys.modules['_datetime']
>>> from datetime import time
>>> time().strftime('\ud800')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed |
I'm finding it very difficult to reconcile these things. I'm not entirely sure, but we may need to take a performance hit in normal strftime if we want to make this work with surrogate characters, which really does not appeal to me (though we can certainly improve to some degree). One major question here: Is anyone (@vstinner, @belopolsky?) know why time's strftime opportunistically uses wcsftime instead of strftime? It makes the code *way* more complicated and difficult to read / maintain - are there platforms that provide wcstrftime and not strftime? Also, related, it seems according to https://bugs.python.org/issue10653#msg243660 that there may have been a regression in issue bpo-10653, which may be related here. Either way, some note should probably be made in the code to clarify exactly *why* these choices were made in the code, in case the situation has changed. |
Because caring of surrogates, I would prefer to first fix time.strftime() and time.strptime() for legit time zone names. Currently, Python uses the C function strfime() because the string produced by wcsftime() cannot be parsed later. We should use the native Windows API to get the time zone name properly encoded. I don't recall the details, sorry. There is maybe an open issue for that one. |
Before* |
@p-ganssle
Paul, this issue was only about reconciling C and Python implementations of datetime module -- not fixing time.strftime(), which both of them delegate to. While the latter is certainly more important, it's also much more difficult, and IMO needs another issue report (maybe merged with bpo-34512). As for the former, it seems that you've already fixed it in your PR. |
@izbyshev That's totally fair and I wouldn't want to make it a condition of merging the existing fixes - I've already made great progress in fixing the time.strftime part as well. The main reason it relates here is that I generally find the tests to be among the hardest part about writing a good PR, and if we can't make assertions about the behavior of strftime outputs, I think it makes it hard to prevent regressions. I figured if I can solve the problem all the way down the stack in one go, I might as well. That said, Victor makes an *extremely* good point that this is an outsized effort for the bug it's fixing. No one really *needs* support for unpaired surrogates in their strftime as far as I can tell. The main reason I'm still working on it is that I'm curious to see if it's even possible to fix. |
Ironically, some of the changes we may have to make to fix time.strftime() inconsistencies may appear like regressions to some users. For example, regarding dropping wcsftime() on all platforms, 'man strftime' on macOS contains the following warning in BUGS section: "The strftime() function does not correctly handle multibyte characters in the format argument.". (I'm not sure what "incorrect handling" means here). That's not to discourage you, just to point out that we should be extra careful at this point when there might be users relying on every variety of the current behavior. And that's also why I don't want to mix the fix for C vs. Python issue (low risk) with changes in time.strftime() (high risk). Also, while I certainly agree with "the outsized effort" point, it seems that your PR goes far beyond supporting surrogates because falling back to escaping allows one to bypass checks in wcsftime()/strftime() and round-trip any code point regardless of the current locale. |
I've left a mostly finished PR on bpo-8983, but I've decided it's not really worth continuing to work on. Anyone can feel free to take it and run with it if they want to implement the fix for this. As Alexey mentions, that PR indeed already fixes this bug, I mainly blocked on getting consistent behavior across all platforms. Currently that PR has consistent behavior across all platforms if you disable wcsftime *except* Ubuntu 14.0, which seems to have some kind of funky bug in wcstombs, though for the life of me I don't know how to consistently trigger it. I suspect that using strftime instead of wcsftime on all platforms would lead to simpler code that works most consistently, but I am guessing wcsftime is faster. Another option is to use wsftime but fall back to strftime on MacOS in the event that the error condition (blank output from non-blank input) is detected. |
Fix time.strftime(), the strftime() method and formatting of the datetime classes datetime, date and time. * Characters not encodable in the current locale are now acceptable in the format string. * Surrogate pairs and sequence of surrogatescape-encoded bytes are no longer recombinated. * Embedded null character no longer terminates the format string. This fixes also pythongh-78662 and pythongh-124531.
Fix time.strftime(), the strftime() method and formatting of the datetime classes datetime, date and time. * Characters not encodable in the current locale are now acceptable in the format string. * Surrogate pairs and sequence of surrogatescape-encoded bytes are no longer recombinated. * Embedded null character no longer terminates the format string. This fixes also gh-78662 and gh-124531.
…5193) Fix time.strftime(), the strftime() method and formatting of the datetime classes datetime, date and time. * Characters not encodable in the current locale are now acceptable in the format string. * Surrogate pairs and sequence of surrogatescape-encoded bytes are no longer recombinated. * Embedded null character no longer terminates the format string. This fixes also pythongh-78662 and pythongh-124531. (cherry picked from commit ad3eac1) Co-authored-by: Serhiy Storchaka <[email protected]>
Fix time.strftime(), the strftime() method and formatting of the datetime classes datetime, date and time. * Characters not encodable in the current locale are now acceptable in the format string. * Surrogate pairs and sequence of surrogatescape-encoded bytes are no longer recombinated. * Embedded null character no longer terminates the format string. This fixes also pythongh-78662 and pythongh-124531. (cherry picked from commit ad3eac1)
…5657) Fix time.strftime(), the strftime() method and formatting of the datetime classes datetime, date and time. * Characters not encodable in the current locale are now acceptable in the format string. * Surrogate pairs and sequence of surrogatescape-encoded bytes are no longer recombinated. * Embedded null character no longer terminates the format string. This fixes also gh-78662 and gh-124531. (cherry picked from commit ad3eac1)
…5193) (pythonGH-125657) Fix time.strftime(), the strftime() method and formatting of the datetime classes datetime, date and time. * Characters not encodable in the current locale are now acceptable in the format string. * Surrogate pairs and sequence of surrogatescape-encoded bytes are no longer recombinated. * Embedded null character no longer terminates the format string. This fixes also pythongh-78662 and pythongh-124531. (cherry picked from commit 08ccbb9) Co-authored-by: Serhiy Storchaka <[email protected]> (cherry picked from commit ad3eac1)
…5657) (GH-125661) Fix time.strftime(), the strftime() method and formatting of the datetime classes datetime, date and time. * Characters not encodable in the current locale are now acceptable in the format string. * Surrogate pairs and sequence of surrogatescape-encoded bytes are no longer recombinated. * Embedded null character no longer terminates the format string. This fixes also gh-78662 and gh-124531. (cherry picked from commit 08ccbb9) (cherry picked from commit ad3eac1) Co-authored-by: Serhiy Storchaka <[email protected]>
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
Linked PRs
The text was updated successfully, but these errors were encountered: