[WIP] Fix incorrect quoting Link.url #6958

atugushev · 2019-08-31T23:10:03Z

chrahunt

In general this looks good, just a few comments.

src/pip/_internal/index.py

chrahunt · 2019-09-01T13:26:44Z

src/pip/_internal/index.py

+        # strings in VCS URLs are properly parsed.
+        path_bits = _safe_chars_re.split(result.path)
+        path_parts = [urllib_parse.quote(urllib_parse.unquote(path_bits[0]))]
+        for i in range(1, len(path_bits), 2):


This may look cleaner if we operate on iter(_safe_chars_re.split(result.path)). So we would have next(path_bits) instead of path_bits[0] and in the loop we could use the pairwise itertools recipe to get element pairs.

I'm afraid i'm not following how to do it. Could you write a sample snippet?

Sure, I was thinking something like this:

path_bits = iter(_safe_chars_re.split(result.path)) path_parts = [urllib_parse.quote(urllib_parse.unquote(next(path_bits)))] for preserved, to_quote in pairwise(path_bits): path_parts.append(preserved) path_parts.append(urllib_parse.quote(urllib_parse.unquote(to_quote)))

I see. Due to doc pairwise(s) produces (s0,s1), (s1,s2), (s2, s3), ..., but forloop is expecting (s0,s1), (s2,s3), (s4, s5),.

Oops, you're right - in this case we can use izip(path_bits, path_bits) or wrap it in our own non-overlapping pairwise like

def pairwise(iterable): iterable = iter(iterable) return izip(iterable, iterable)

Also, it would be better to use a pattern that doesn't require repeating urllib_parse.quote(urllib_parse.unquote()) twice, once before the loop and again during the loop. A couple options for doing that include adding an empty string to make the list have an even number of elements and perhaps using itertools.zip_longest().

Nice idea, something like pairwise(chain(path_bits, [''])) would work.

chrahunt · 2019-09-01T13:37:59Z

src/pip/_internal/index.py

-        # revision strings in VCS URLs are properly parsed.
-        path = urllib_parse.quote(urllib_parse.unquote(result.path), safe="/@")
+        # In addition to the `/` character we protect `@`
+        # (as well as theirs %-escapes) so that revision


I don't think the %-escapes are for VCS URLs to be properly parsed as much as avoiding side-effects of our preserving them. Like:

We preserve @ as-is (in addition to the default /) so that VCS URLs are properly parsed

We preserve their %-escapes so they are not inadvertently converted to the literal characters

Do you suggest to update the comment? WDYT maybe it's better to set this comment above the _safe_chars_re?

Maybe we could say: We quote the path for convenience, but VCS URL paths have a plain @ which we should preserve. To avoid double-quoting we unquote then quote, but extract any manually quoted @ and / beforehand so they are not accidentally preserved.

chrahunt · 2019-09-01T13:40:41Z

src/pip/_internal/index.py

@@ -1313,6 +1313,9 @@ def _get_encoding_from_headers(headers):
    return None


+_safe_chars_re = re.compile(r'([/@]|%2F|%40)', re.I)


I would add a comment above, like

# percent-encoded: / @ _safe_chars_re = re.compile(r'([/@]|%2F|%40)', re.I)

I never would've thought of doing this, and this is a good idea! ^>^

cjerdonek · 2019-09-08T23:43:12Z

src/pip/_internal/index.py

+            path_parts.append(
+                urllib_parse.quote(urllib_parse.unquote(path_bits[i + 1]))
+            )
+        path = ''.join(path_parts)


Sorry for jumping in on this later. I think the code inside this else block should definitely be split out into an independently tested function, since it's a bit tricky. (I had a partially finished PR for this issue as well on my machine, using a similar approach. I'd also like to compare with what I have to see if there is anything else I noticed.)

Actually, here is the PR I started working on a while ago. It's related but not quite the same: #6496

I think the code inside this else block should definitely be split out into an independently tested function, since it's a bit tricky.

That sounds reasonable.

Actually, here is the PR I started working on a while ago. It's related but not quite the same: #6496

I see you have almost finished it. What should we do with the current PR?

I see you have almost finished it. What should we do with the current PR?

I would just examine the other PR for any differences to see if anything might be missing here and/or worth carrying over. Like, I noticed the other PR also extends the treatment to the file:// case, which I haven't thought recently about to know if it's desired. I think the test cases in the other PR might be something else worth carrying over (at least the applicable ones) since I had put some thought into those.

Also, I think the "revision" portion of the other PR (e.g. the changes to make_vcs_requirement_url()) can be left for that issue / PR, as that issue is distinct from the one this PR is trying to resolve.

May i cherry-pick 2133c1d and b978ff6?

Sure, whatever is easiest for you.. You'll probably want to be making changes on top of that though as I'm guessing you will. For example, I didn't use the regex approach as I was splitting only on one character for that PR.

You'll probably want to be making changes on top of that though as I'm guessing you will

Thanks! That's exactly i was going to do :)

cjerdonek · 2019-09-09T03:19:18Z

src/pip/_internal/index.py

@@ -1313,6 +1313,9 @@ def _get_encoding_from_headers(headers):
    return None


+_safe_chars_re = re.compile(r'([/@]|%2F|%40)', re.I)


Given that you're calling quote(unquote()) on what's left (and in particular that the return value will wind up quoted anyways), it seems like you shouldn't need to be including the quoted variants in the regex. Really, it should just be splitting on the individual characters you want to preserve as is. You can use safe='' to be sure that / gets quoted in the quote direction.

(Given the choice, it's better to let urllib.parse do its job rather than trying to do some of it "by hand." This was part of the motivation for removing the use of regexes in previous PR's that touched _clean_link().)

IIUC if the quoted variants are not preserved then after the initial unquote it would not be possible to distinguish a //@ that was intended to be used as-is (to separate path parts or to separate the main URL from the vcs version) from one that was quoted as part of the credentials.

The quoted and unquoted variants will already be distinguished because that's what the regex will be doing (separating them into two groups). The only ones getting unquoted will be the ones that should be requoted.

@cjerdonek

Really, it should just be splitting on the individual characters you want to preserve as is. You can use safe='' to be sure that / gets quoted in the quote direction.

You are right! It does work. The only downside is it makes %xx escaping uppercase, for example:

>>> from urllib.parse import quote, unquote >>> print(quote(unquote("%2F%2f"), safe="")) %2F%2F

Not sure whether the case-sensitivity is important here.

@cjerdonek working on PR i found that with local paths (like /C:/path/) there is no way to set safe="":

pathname2url(url2pathname(result.path))

because pathname2url doesn't support it. Well, should we split path by re.compile(r'([/@]|%2F|%40)', re.I) then? Or any other ideas?

Suddenly i've figured out that it should be split on re.compile(r'(@|%2F', re.I) and it seems work now!

Are you sure that any special characters need to be preserved in the "path" case? Can you provide an example?

It looks like the "path" case only '@' needs to be preserved. For the URL paths -- '@' and '%2f'.

Thanks, @atugushev. Yes, I suspected they are different. (I also had some other edge cases to share, but I no longer need to share those.)

Because the logic is different in at least a couple respects, I think the "path" and "url" cases should be handled by different code, as opposed to e.g. trying to use the same splitting function for both. I would also recommend doing the path case in a separate PR, because there is some added trickiness that would be worth discussing separately. And if the URL case is going to be handled separately from the path case, that would let you go back to leaving the quoted variants out of the regex, as I suggested earlier above.

…rings Co-Authored-By: Chris Jerdonek <[email protected]>

Co-Authored-By: Chris Jerdonek <[email protected]>

Co-Authored-By: Chris Hunt <[email protected]> Co-Authored-By: Chris Jerdonek <[email protected]>

BrownTruck · 2019-09-15T02:15:04Z

Hello!

I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the master branch into this pull request or rebase this pull request against master then it will be eligible for code review and hopefully merging!

pradyunsg · 2019-11-19T21:49:40Z

@atugushev Would you be able to pick this back up?

atugushev · 2019-11-25T18:44:46Z

@pradyunsg

Would you be able to pick this back up?

Yeah, sorry for the delay. I'll revisit this PR.

atugushev · 2019-12-08T17:46:16Z

@pradyunsg sorry, have no energy to fix this, mostly cause I don't use windows much. Better to fix it by someone else.

pradyunsg · 2020-01-07T08:10:19Z

/cc @uranusjr in case they're interested in exploring this PR. :)

chrahunt reviewed Sep 1, 2019

View reviewed changes

chrahunt added the type: bugfix label Sep 1, 2019

cjerdonek reviewed Sep 8, 2019

View reviewed changes

cjerdonek reviewed Sep 9, 2019

View reviewed changes

atugushev and others added 5 commits September 13, 2019 19:37

Update _clean_link() to support special characters in VCS revision st…

568fd39

…rings Co-Authored-By: Chris Jerdonek <[email protected]>

Add failing test for VCS URL with Windows drive letter and revision

64b7b0f

Co-Authored-By: Chris Jerdonek <[email protected]>

Use is_local_path as parametrized argument

7f3cdbd

Join tests and use pytest.param to skip certain parametrizations

f7c16d0

Add test-cases for '/' to test_clean_url_path

7d40f91

atugushev force-pushed the fix-issue-6446 branch from 942df1f to 5d63070 Compare September 13, 2019 16:43

atugushev and others added 3 commits September 13, 2019 20:05

Add 'pairwise' utility function

898d2c5

Co-Authored-By: Chris Hunt <[email protected]> Co-Authored-By: Chris Jerdonek <[email protected]>

Fix _clean_url_path() unquotes quoted '/' char

d07591e

Add news entry

eb83d99

atugushev force-pushed the fix-issue-6446 branch from acc96f5 to eb83d99 Compare September 13, 2019 17:05

atugushev changed the title ~~Fix incorrect quoting Link.url~~ [WIP] Fix incorrect quoting Link.url Sep 13, 2019

atugushev added 2 commits September 13, 2019 21:05

Fix failing tests on Windows

854fca0

Add test-case with quoted '/' for test_clean_link

441d9c8

BrownTruck added the needs rebase or merge PR has conflicts with current master label Sep 15, 2019

chrahunt mentioned this pull request Oct 13, 2019

[WIP] Support special characters like # and @ in VCS revision strings #6496

Closed

pradyunsg added the S: awaiting response Waiting for a response/more information label Nov 19, 2019

atugushev closed this Dec 8, 2019

chrahunt mentioned this pull request Dec 24, 2019

Support special characters like # in Git branch names #5742

Closed

uranusjr mentioned this pull request Jan 14, 2020

Fix incorrect quoting Link.url #7596

Merged

lock bot added the auto-locked Outdated issues that have been locked by automation label Feb 6, 2020

lock bot locked as resolved and limited conversation to collaborators Feb 6, 2020

		@@ -1313,6 +1313,9 @@ def _get_encoding_from_headers(headers):
		return None


		_safe_chars_re = re.compile(r'([/@]\|%2F\|%40)', re.I)

[WIP] Fix incorrect quoting Link.url #6958

[WIP] Fix incorrect quoting Link.url #6958

Conversation

atugushev commented Aug 31, 2019

chrahunt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjerdonek Sep 9, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atugushev Sep 9, 2019 • edited Loading

Choose a reason for hiding this comment

atugushev Sep 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BrownTruck commented Sep 15, 2019

pradyunsg commented Nov 19, 2019

atugushev commented Nov 25, 2019

atugushev commented Dec 8, 2019

pradyunsg commented Jan 7, 2020

cjerdonek Sep 9, 2019 •

edited

Loading

atugushev Sep 9, 2019 •

edited

Loading

atugushev Sep 12, 2019 •

edited

Loading