Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Fix incorrect quoting Link.url #6958

Closed
wants to merge 10 commits into from

Conversation

atugushev
Copy link
Contributor

Fixes #6446.

Copy link
Member

@chrahunt chrahunt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this looks good, just a few comments.

src/pip/_internal/index.py Outdated Show resolved Hide resolved
# strings in VCS URLs are properly parsed.
path_bits = _safe_chars_re.split(result.path)
path_parts = [urllib_parse.quote(urllib_parse.unquote(path_bits[0]))]
for i in range(1, len(path_bits), 2):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may look cleaner if we operate on iter(_safe_chars_re.split(result.path)). So we would have next(path_bits) instead of path_bits[0] and in the loop we could use the pairwise itertools recipe to get element pairs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid i'm not following how to do it. Could you write a sample snippet?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I was thinking something like this:

path_bits = iter(_safe_chars_re.split(result.path))
path_parts = [urllib_parse.quote(urllib_parse.unquote(next(path_bits)))]
for preserved, to_quote in pairwise(path_bits):
    path_parts.append(preserved)
    path_parts.append(urllib_parse.quote(urllib_parse.unquote(to_quote)))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Due to doc pairwise(s) produces (s0,s1), (s1,s2), (s2, s3), ..., but forloop is expecting (s0,s1), (s2,s3), (s4, s5),.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, you're right - in this case we can use izip(path_bits, path_bits) or wrap it in our own non-overlapping pairwise like

def pairwise(iterable):
    iterable = iter(iterable)
    return izip(iterable, iterable)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it would be better to use a pattern that doesn't require repeating urllib_parse.quote(urllib_parse.unquote()) twice, once before the loop and again during the loop. A couple options for doing that include adding an empty string to make the list have an even number of elements and perhaps using itertools.zip_longest().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea, something like pairwise(chain(path_bits, [''])) would work.

# revision strings in VCS URLs are properly parsed.
path = urllib_parse.quote(urllib_parse.unquote(result.path), safe="/@")
# In addition to the `/` character we protect `@`
# (as well as theirs %-escapes) so that revision
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the %-escapes are for VCS URLs to be properly parsed as much as avoiding side-effects of our preserving them. Like:

  • We preserve @ as-is (in addition to the default /) so that VCS URLs are properly parsed
  • We preserve their %-escapes so they are not inadvertently converted to the literal characters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you suggest to update the comment? WDYT maybe it's better to set this comment above the _safe_chars_re?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could say: We quote the path for convenience, but VCS URL paths have a plain @ which we should preserve. To avoid double-quoting we unquote then quote, but extract any manually quoted @ and / beforehand so they are not accidentally preserved.

@@ -1313,6 +1313,9 @@ def _get_encoding_from_headers(headers):
return None


_safe_chars_re = re.compile(r'([/@]|%2F|%40)', re.I)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a comment above, like

# percent-encoded:                   /   @
_safe_chars_re = re.compile(r'([/@]|%2F|%40)', re.I)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never would've thought of doing this, and this is a good idea! ^>^

path_parts.append(
urllib_parse.quote(urllib_parse.unquote(path_bits[i + 1]))
)
path = ''.join(path_parts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for jumping in on this later. I think the code inside this else block should definitely be split out into an independently tested function, since it's a bit tricky. (I had a partially finished PR for this issue as well on my machine, using a similar approach. I'd also like to compare with what I have to see if there is anything else I noticed.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, here is the PR I started working on a while ago. It's related but not quite the same: #6496

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code inside this else block should definitely be split out into an independently tested function, since it's a bit tricky.

That sounds reasonable.

Actually, here is the PR I started working on a while ago. It's related but not quite the same: #6496

I see you have almost finished it. What should we do with the current PR?

Copy link
Member

@cjerdonek cjerdonek Sep 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you have almost finished it. What should we do with the current PR?

I would just examine the other PR for any differences to see if anything might be missing here and/or worth carrying over. Like, I noticed the other PR also extends the treatment to the file:// case, which I haven't thought recently about to know if it's desired. I think the test cases in the other PR might be something else worth carrying over (at least the applicable ones) since I had put some thought into those.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think the "revision" portion of the other PR (e.g. the changes to make_vcs_requirement_url()) can be left for that issue / PR, as that issue is distinct from the one this PR is trying to resolve.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May i cherry-pick 2133c1d and b978ff6?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, whatever is easiest for you.. You'll probably want to be making changes on top of that though as I'm guessing you will. For example, I didn't use the regex approach as I was splitting only on one character for that PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll probably want to be making changes on top of that though as I'm guessing you will

Thanks! That's exactly i was going to do :)

@@ -1313,6 +1313,9 @@ def _get_encoding_from_headers(headers):
return None


_safe_chars_re = re.compile(r'([/@]|%2F|%40)', re.I)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that you're calling quote(unquote()) on what's left (and in particular that the return value will wind up quoted anyways), it seems like you shouldn't need to be including the quoted variants in the regex. Really, it should just be splitting on the individual characters you want to preserve as is. You can use safe='' to be sure that / gets quoted in the quote direction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Given the choice, it's better to let urllib.parse do its job rather than trying to do some of it "by hand." This was part of the motivation for removing the use of regexes in previous PR's that touched _clean_link().)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC if the quoted variants are not preserved then after the initial unquote it would not be possible to distinguish a //@ that was intended to be used as-is (to separate path parts or to separate the main URL from the vcs version) from one that was quoted as part of the credentials.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quoted and unquoted variants will already be distinguished because that's what the regex will be doing (separating them into two groups). The only ones getting unquoted will be the ones that should be requoted.

Copy link
Contributor Author

@atugushev atugushev Sep 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjerdonek

Really, it should just be splitting on the individual characters you want to preserve as is. You can use safe='' to be sure that / gets quoted in the quote direction.

You are right! It does work. The only downside is it makes %xx escaping uppercase, for example:

>>> from urllib.parse import quote, unquote
>>> print(quote(unquote("%2F%2f"), safe=""))
%2F%2F

Not sure whether the case-sensitivity is important here.

Copy link
Contributor Author

@atugushev atugushev Sep 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjerdonek working on PR i found that with local paths (like /C:/path/) there is no way to set safe="":

pathname2url(url2pathname(result.path))

because pathname2url doesn't support it. Well, should we split path by re.compile(r'([/@]|%2F|%40)', re.I) then? Or any other ideas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suddenly i've figured out that it should be split on re.compile(r'(@|%2F', re.I) and it seems work now!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that any special characters need to be preserved in the "path" case? Can you provide an example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the "path" case only '@' needs to be preserved. For the URL paths -- '@' and '%2f'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @atugushev. Yes, I suspected they are different. (I also had some other edge cases to share, but I no longer need to share those.)

Because the logic is different in at least a couple respects, I think the "path" and "url" cases should be handled by different code, as opposed to e.g. trying to use the same splitting function for both. I would also recommend doing the path case in a separate PR, because there is some added trickiness that would be worth discussing separately. And if the URL case is going to be handled separately from the path case, that would let you go back to leaving the quoted variants out of the regex, as I suggested earlier above.

@atugushev atugushev changed the title Fix incorrect quoting Link.url [WIP] Fix incorrect quoting Link.url Sep 13, 2019
@BrownTruck
Copy link
Contributor

Hello!

I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the master branch into this pull request or rebase this pull request against master then it will be eligible for code review and hopefully merging!

@pradyunsg
Copy link
Member

@atugushev Would you be able to pick this back up?

@pradyunsg pradyunsg added the S: awaiting response Waiting for a response/more information label Nov 19, 2019
@atugushev
Copy link
Contributor Author

@pradyunsg

Would you be able to pick this back up?

Yeah, sorry for the delay. I'll revisit this PR.

@atugushev
Copy link
Contributor Author

@pradyunsg sorry, have no energy to fix this, mostly cause I don't use windows much. Better to fix it by someone else.

@pradyunsg
Copy link
Member

/cc @uranusjr in case they're interested in exploring this PR. :)

@lock lock bot added the auto-locked Outdated issues that have been locked by automation label Feb 6, 2020
@lock lock bot locked as resolved and limited conversation to collaborators Feb 6, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation needs rebase or merge PR has conflicts with current master S: awaiting response Waiting for a response/more information
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Quoted / character in URL becomes unquoted
5 participants