-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Fix incorrect quoting Link.url #6958
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this looks good, just a few comments.
src/pip/_internal/index.py
Outdated
# strings in VCS URLs are properly parsed. | ||
path_bits = _safe_chars_re.split(result.path) | ||
path_parts = [urllib_parse.quote(urllib_parse.unquote(path_bits[0]))] | ||
for i in range(1, len(path_bits), 2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may look cleaner if we operate on iter(_safe_chars_re.split(result.path))
. So we would have next(path_bits)
instead of path_bits[0]
and in the loop we could use the pairwise
itertools recipe to get element pairs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid i'm not following how to do it. Could you write a sample snippet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I was thinking something like this:
path_bits = iter(_safe_chars_re.split(result.path))
path_parts = [urllib_parse.quote(urllib_parse.unquote(next(path_bits)))]
for preserved, to_quote in pairwise(path_bits):
path_parts.append(preserved)
path_parts.append(urllib_parse.quote(urllib_parse.unquote(to_quote)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Due to doc pairwise(s)
produces (s0,s1), (s1,s2), (s2, s3), ...
, but forloop is expecting (s0,s1), (s2,s3), (s4, s5),
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, you're right - in this case we can use izip(path_bits, path_bits)
or wrap it in our own non-overlapping pairwise like
def pairwise(iterable):
iterable = iter(iterable)
return izip(iterable, iterable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it would be better to use a pattern that doesn't require repeating urllib_parse.quote(urllib_parse.unquote())
twice, once before the loop and again during the loop. A couple options for doing that include adding an empty string to make the list have an even number of elements and perhaps using itertools.zip_longest()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea, something like pairwise(chain(path_bits, ['']))
would work.
src/pip/_internal/index.py
Outdated
# revision strings in VCS URLs are properly parsed. | ||
path = urllib_parse.quote(urllib_parse.unquote(result.path), safe="/@") | ||
# In addition to the `/` character we protect `@` | ||
# (as well as theirs %-escapes) so that revision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the %-escapes are for VCS URLs to be properly parsed as much as avoiding side-effects of our preserving them. Like:
- We preserve
@
as-is (in addition to the default/
) so that VCS URLs are properly parsed - We preserve their %-escapes so they are not inadvertently converted to the literal characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you suggest to update the comment? WDYT maybe it's better to set this comment above the _safe_chars_re
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could say: We quote the path for convenience, but VCS URL paths have a plain @
which we should preserve. To avoid double-quoting we unquote then quote, but extract any manually quoted @
and /
beforehand so they are not accidentally preserved.
src/pip/_internal/index.py
Outdated
@@ -1313,6 +1313,9 @@ def _get_encoding_from_headers(headers): | |||
return None | |||
|
|||
|
|||
_safe_chars_re = re.compile(r'([/@]|%2F|%40)', re.I) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a comment above, like
# percent-encoded: / @
_safe_chars_re = re.compile(r'([/@]|%2F|%40)', re.I)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I never would've thought of doing this, and this is a good idea! ^>^
src/pip/_internal/index.py
Outdated
path_parts.append( | ||
urllib_parse.quote(urllib_parse.unquote(path_bits[i + 1])) | ||
) | ||
path = ''.join(path_parts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for jumping in on this later. I think the code inside this else
block should definitely be split out into an independently tested function, since it's a bit tricky. (I had a partially finished PR for this issue as well on my machine, using a similar approach. I'd also like to compare with what I have to see if there is anything else I noticed.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, here is the PR I started working on a while ago. It's related but not quite the same: #6496
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code inside this
else
block should definitely be split out into an independently tested function, since it's a bit tricky.
That sounds reasonable.
Actually, here is the PR I started working on a while ago. It's related but not quite the same: #6496
I see you have almost finished it. What should we do with the current PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you have almost finished it. What should we do with the current PR?
I would just examine the other PR for any differences to see if anything might be missing here and/or worth carrying over. Like, I noticed the other PR also extends the treatment to the file://
case, which I haven't thought recently about to know if it's desired. I think the test cases in the other PR might be something else worth carrying over (at least the applicable ones) since I had put some thought into those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think the "revision" portion of the other PR (e.g. the changes to make_vcs_requirement_url()
) can be left for that issue / PR, as that issue is distinct from the one this PR is trying to resolve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, whatever is easiest for you.. You'll probably want to be making changes on top of that though as I'm guessing you will. For example, I didn't use the regex approach as I was splitting only on one character for that PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll probably want to be making changes on top of that though as I'm guessing you will
Thanks! That's exactly i was going to do :)
src/pip/_internal/index.py
Outdated
@@ -1313,6 +1313,9 @@ def _get_encoding_from_headers(headers): | |||
return None | |||
|
|||
|
|||
_safe_chars_re = re.compile(r'([/@]|%2F|%40)', re.I) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that you're calling quote(unquote())
on what's left (and in particular that the return value will wind up quoted anyways), it seems like you shouldn't need to be including the quoted variants in the regex. Really, it should just be splitting on the individual characters you want to preserve as is. You can use safe=''
to be sure that /
gets quoted in the quote direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Given the choice, it's better to let urllib.parse
do its job rather than trying to do some of it "by hand." This was part of the motivation for removing the use of regexes in previous PR's that touched _clean_link()
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC if the quoted variants are not preserved then after the initial unquote it would not be possible to distinguish a /
/@
that was intended to be used as-is (to separate path parts or to separate the main URL from the vcs version) from one that was quoted as part of the credentials.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The quoted and unquoted variants will already be distinguished because that's what the regex will be doing (separating them into two groups). The only ones getting unquoted will be the ones that should be requoted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really, it should just be splitting on the individual characters you want to preserve as is. You can use
safe=''
to be sure that/
gets quoted in the quote direction.
You are right! It does work. The only downside is it makes %xx escaping uppercase, for example:
>>> from urllib.parse import quote, unquote
>>> print(quote(unquote("%2F%2f"), safe=""))
%2F%2F
Not sure whether the case-sensitivity is important here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cjerdonek working on PR i found that with local paths (like /C:/path/) there is no way to set safe=""
:
pathname2url(url2pathname(result.path))
because pathname2url doesn't support it. Well, should we split path by re.compile(r'([/@]|%2F|%40)', re.I)
then? Or any other ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suddenly i've figured out that it should be split on re.compile(r'(@|%2F', re.I)
and it seems work now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure that any special characters need to be preserved in the "path" case? Can you provide an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the "path" case only '@' needs to be preserved. For the URL paths -- '@' and '%2f'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @atugushev. Yes, I suspected they are different. (I also had some other edge cases to share, but I no longer need to share those.)
Because the logic is different in at least a couple respects, I think the "path" and "url" cases should be handled by different code, as opposed to e.g. trying to use the same splitting function for both. I would also recommend doing the path case in a separate PR, because there is some added trickiness that would be worth discussing separately. And if the URL case is going to be handled separately from the path case, that would let you go back to leaving the quoted variants out of the regex, as I suggested earlier above.
…rings Co-Authored-By: Chris Jerdonek <[email protected]>
Co-Authored-By: Chris Jerdonek <[email protected]>
942df1f
to
5d63070
Compare
Co-Authored-By: Chris Hunt <[email protected]> Co-Authored-By: Chris Jerdonek <[email protected]>
acc96f5
to
eb83d99
Compare
Hello! I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the |
@atugushev Would you be able to pick this back up? |
Yeah, sorry for the delay. I'll revisit this PR. |
@pradyunsg sorry, have no energy to fix this, mostly cause I don't use windows much. Better to fix it by someone else. |
/cc @uranusjr in case they're interested in exploring this PR. :) |
Fixes #6446.