-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reorder requirements file decoding #12795
base: main
Are you sure you want to change the base?
Reorder requirements file decoding #12795
Conversation
a3f1cac
to
aa0f744
Compare
aa0f744
to
7df3500
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this probably needs a proper news entry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please.
src/pip/_internal/req/req_file.py
Outdated
warnings.warn( | ||
f"unable to decode data with {exc.encoding}, falling back to {fallback_encoding}", # noqa: E501 | ||
UnicodeWarning, | ||
stacklevel=2, | ||
) | ||
content = raw_content.decode(fallback_encoding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be ideal to include filename or filepath of the requirements file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be ideal to include filename or filepath of the requirements file.
Agreed, cae26c0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please.
src/pip/_internal/req/req_file.py
Outdated
warnings.warn( | ||
f"unable to decode data with {exc.encoding}, falling back to {fallback_encoding}", # noqa: E501 | ||
UnicodeWarning, | ||
stacklevel=2, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with using
warnings.warn
is that its presentation format is inappropriately technical.logger.warning
should be used instead.
I think I just went with this because I knew UnicodeWarning
was a thing, happy to go with logging cae26c0
7df3500
to
b4c3255
Compare
src/pip/_internal/req/req_file.py
Outdated
exc.encoding, | ||
fallback_encoding, | ||
) | ||
content = raw_content.decode(fallback_encoding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be a good idea to use error="backslashreplace"
here. Most of the time, the offending bytes would just be a part of a comment anyway and would not make a difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be a good idea to use
error="backslashreplace"
here. Most of the time, the offending bytes would just be a part of a comment anyway and would not make a difference.
I've been hesitating with this a bit, specifically I'm wondering if this could be abused for nefarious purposes where the contents of the file you 'see' (not well defined, since this is the case where the data won't fully decode) isn't the same contents that pip will process. Though I'm having a hard time finding a vulnerable use-case (something like injecting an extra element or a adding a .
to a domain name in a requirement URL)
Hmm, since the documentation says it's utf-8 unless there is a PEP-263 style comment, shouldn't we rather decode as utf8 is there is no such comment, and if that fails, fallback to the current That way we have a (more or less) non-breaking path to compliance with the docs? Also, I'd put all that in auto_decode, with a docstring comment that the function is meant for requirements.txt decoding as per the docs. |
This sounds reasonable, though I think this change would need to be made in |
Sounds good. I've removed from the 24.3 milestone. Feel free to ping me when you get back to this. |
This changes the decoding process to be more in line with what was previously documented. The new process is outlined in the updated docs. The `auto_decode` function was removed and all decoding logic moved to the `pip._internal.req.req_file` module because: * This function was only ever used to decode requirements file * It was never really a generic 'util' function, it was always tied to the idiosyncrasies of decoding requirements files. * The module lived under `_internal` so I felt comfortable removing it A warning was added when we _do_ fallback to using the locale defined encoding to encourage users to move to an explicit encoding definition via a coding style comment. This fixes two existing bugs. Firstly, when: * a requirements file is encoded as UTF-8, and * some bytes in the file are incompatible with the system locale Previously, assuming no BOM or PEP-263 style comment, we would default to using the encoding from the system locale, which would then fail (see issue pypa#12771) Secondly, when decoding a file starting with a UTF-32 little endian Byte Order Marker. Previously this would always fail since `codecs.BOM_UTF32_LE` is `codecs.BOM_UTF16_LE` followed by two null bytes, and because of the ordering of the list of BOMs we the UTF-16 case would be run first and match the file prefix so we would incorrectly deduce that the file was UTF-16 little endian encoded. I can't imagine this is a popular encoding for a requirements file. Fixes: pypa#12771
b4c3255
to
d0bf895
Compare
👍 I've updated the change and title+description. It was basically a re-do so I just stomped my previous commits. per the description: I found and fixed another bug while testing this: requirements files starting with a UTF-32 LE BOM would always be decoded as UTF-16 LE |
This changes the decoding process to be more in line with what was
previously documented. The new process is outlined in the updated docs.
The
auto_decode
function was removed and all decoding logic moved tothe
pip._internal.req.req_file
module because:the idiosyncrasies of decoding requirements files.
_internal
so I felt comfortable removing itA warning was added when we do fallback to using the locale defined
encoding to encourage users to move to an explicit encoding definition
via a coding style comment.
This fixes two existing bugs. Firstly, when:
Previously, assuming no BOM or PEP-263 style comment, we would default
to using the encoding from the system locale, which would then fail (see
issue #12771)
Secondly, when decoding a file starting with a UTF-32 little endian Byte
Order Marker. Previously this would always fail since
codecs.BOM_UTF32_LE
iscodecs.BOM_UTF16_LE
followed by two nullbytes, and because of the ordering of the list of BOMs we the UTF-16
case would be run first and match the file prefix so we would
incorrectly deduce that the file was UTF-16 little endian encoded. I
can't imagine this is a popular encoding for a requirements file.
Fixes: #12771