urljoin behaves differently with custom and standard schemas #63028

mher · 2013-08-25T12:52:00Z

BPO	18828
Nosy	@orsenthil, @mher, @berkerpeksag, @vadmium, @demianbrecht
Dependencies	bpo-16134: Add support for RTMP schemes to urlparse bpo-23759: urllib.parse: make coap:// known bpo-25895: urllib.parse.urljoin does not handle WebSocket URLs
Files	urljoin-scheme.patch: scheme not checked urljoin-non-hier.patch: join if not in non_hierarchical

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2013-08-25.12:52:00.151>
labels = ['type-bug', 'library']
title = 'urljoin behaves differently with custom and standard schemas'
updated_at = <Date 2016-09-13.04:22:49.352>
user = 'https://github.com/mher'

bugs.python.org fields:

activity = <Date 2016-09-13.04:22:49.352>
actor = 'martin.panter'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2013-08-25.12:52:00.151>
creator = 'mher'
dependencies = ['16134', '23759', '25895']
files = ['37480', '38698']
hgrepos = []
issue_num = 18828
keywords = ['patch']
message_count = 16.0
messages = ['196125', '196230', '196515', '196775', '196794', '196795', '232799', '238363', '238497', '238534', '238536', '239125', '239324', '239341', '239363', '276162']
nosy_count = 6.0
nosy_names = ['orsenthil', 'mher', 'berker.peksag', 'martin.panter', 'demian.brecht', 'madison.may']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue18828'
versions = ['Python 2.7', 'Python 3.4', 'Python 3.5']

mher · 2013-08-25T12:52:00Z

>>> urljoin('redis://localhost:6379/0', '/1')
'/1'
>>> urljoin('http://localhost:6379/0', '/1')
'http://localhost:6379/1'

MadisonMay · 2013-08-26T17:34:18Z

From urllib.parse:

    uses_relative = ['ftp', 'http', 'gopher', 'nntp', 'imap',
                     'wais', 'file', 'https', 'shttp', 'mms',
                     'prospero', 'rtsp', 'rtspu', '', 'sftp',
                     'svn', 'svn+ssh']

From urllib.parse.urljoin (scheme='redis' and url='/1' in your example):

if scheme != bscheme or scheme not in uses_relative:
    return \_coerce_result(url)

Should the 'redis' scheme be added to uses_relative, perhaps?

vadmium · 2013-08-30T06:08:19Z

Similarly, I expected this to return "rtmp://host/app?auth=token":

urljoin("rtmp://host/app", "?auth=token")

I'm not sure adding everybody's custom scheme to a hard-coded whitelist is the best way to do solve this.

Below I have identified some other schemes not in the "uses_relative" list. Is there any reason why one would use urljoin() with them, but want the base URL to be ignored (as is the current behaviour)? I looked at test_urlparse.py and there doesn't seem to be any test cases for these schemes.

>>> all = set().union(uses_relative, uses_netloc, uses_params, non_hierarchical, uses_query, uses_fragment)
>>> sorted(all.difference(uses_relative))
['git', 'git+ssh', 'hdl', 'mailto', 'news', 'nfs', 'rsync', 'sip', 'sips', 'snews', 'tel', 'telnet']

Even if the behaviour can't be changed, could the documentation for urljoin() say something like this:

Only the following [uses_relative] schemes are allowed in the base URL; any other schemes result in the relative URL being returned without being joined to the base.

berkerpeksag · 2013-09-02T08:59:39Z

How about adding a codecs.register like public API for 3.4+?

    import urllib.parse

    urllib.parse.schemes.register('redis', 'rtmp')

or:

    urllib.parse.urljoin('redis://localhost:6379/0', '/1', scheme='redis')

or just:

    urllib.parse.schemes.extend(['redis', 'rtmp'])

MadisonMay · 2013-09-02T17:39:09Z

If nothing else, we should document the work around for this issue.

>>> import urllib.parse
>>> urllib.parse.uses_relative.append('redis')
>>> urllib.parse.uses_netloc.append('redis')
>>> urllib.parse.urljoin('redis://localhost:6379/0', '/1')
'redis://localhost:6379/1'

MadisonMay · 2013-09-02T17:49:33Z

How about adding a codecs.register like public API for 3.4+?

A codecs style register function seems like an excellent solution to me.

vadmium · 2014-12-17T04:28:13Z

I think a global registry seems like overkill. Here is a patch to make urljoin() treat schemes more equally and work with arbitrary schemes automatically. I haven’t heard any arguments against this option yet, and it didn’t break any tests.

Another option, still simpler than a registry, would be an extra parameter, say urljoin(a, b, any_scheme=True).

demianbrecht · 2015-03-17T23:47:46Z

I haven’t heard any arguments against this option yet, and it didn’t break any tests.

Pre patch:

>>> urljoin('mailto:foo@', 'bar.com')
'bar.com'

Post patch:

>>> urljoin('mailto:foo@', 'bar.com')
'mailto:bar.com/bar.com'

I'm taking an educated guess here based on a marginal amount of research (there are just a few registered schemes at http://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml that should be understood), but it /seems/ like perhaps the current behaviour is intended to safeguard against joining non-hierarchical schemes in which case you'd get nonsensical values. It does seem a little odd to me, but I definitely prefer the former behaviour to the latter.

I think that short term, Madison's suggestion about documenting uses_relative would be an easy win and can be applied to all branches. Long term though, I think it would be nice to have a generalized urljoin() method that accounts for most (if not all) classifications of url schemes.

Thoughts?

vadmium · 2015-03-19T07:25:17Z

I opened bpo-23703 about the funny doubled bar.com result. After backing out revision 901e4e52b20a, but with my patch here applied:

>>> urljoin('mailto:foo@', 'bar.com')
'mailto:bar.com'

which seems fairly sensible to me. A more awkward question is if this behaviour of my patch is reasonable:

>>> urljoin('mailto:person-foo/[email protected]', 'bar.com')
'mailto:person-foo/bar.com'

Yet another option, similar to my “any_scheme=True” flag, might be to change from the “uses_relative” white-list to a “not_relative” black-list of URL schemes, so that urljoin() works for arbitrary schemes except for ones like “mailto:” that are in the hard-coded list.

demianbrecht · 2015-03-19T15:36:23Z

>>> urljoin('mailto:foo@', 'bar.com')
'mailto:bar.com'

which seems fairly sensible to me.

This is where joining arbitrary protocols gets tricky. Does it make sense to merge non-hierarchical protocols such as mailto? My initial reaction is "no" and what should actually happen here is one of two things:

The result is a simple concatenation: "mailto:[email protected]".
An exception is raised indicating that urljoin cannot determine how to handle merging base and url.

The above could happen in cases where either scheme is None for both base and url or the scheme to be used is any of urllib.parse.non_hierarchical.

A more awkward question is if this behaviour of my patch is reasonable:

>>> urljoin('mailto:person-foo/[email protected]', 'bar.com')
'mailto:person-foo/bar.com'

A couple thoughts on this: If urllib.parse.non_hierarchical is used to determine merge vs. simple concat (or exception), this specific case won't be an issue. Also, according to 6068, "mailto:person-foo/[email protected]' is invalid (the "/" should be percent-encoded), but I don't think it should be the job of urljoin to understand the URI structures of each protocol, outside of logically join base and url.

Yet another option, similar to my “any_scheme=True” flag, might be to change from the “uses_relative” white-list to a “not_relative” black-list of URL schemes, so that urljoin() works for arbitrary schemes except for ones like “mailto:” that are in the hard-coded list.

This list may already be present in urllib.parse.non_hierarchical. I also think it's worthwhile to do some further research against http://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml to ensure the list is up to date.

If this path is chosen, I would suggest getting sign off from a couple core devs prior to investing time in this as all changes discussed so far are backwards incompatible.

demianbrecht · 2015-03-19T15:40:13Z

Also, I would suggest still including the doc changes proposed by Madison in all versions prior to 3.5.

berkerpeksag · 2015-03-24T14:18:40Z

Yet another option, similar to my “any_scheme=True” flag, might be to change from the “uses_relative” white-list to a “not_relative” black-list of URL schemes, [...]

I think this looks like a good solution.

vadmium · 2015-03-26T11:26:36Z

The current behaviour when no scheme is present is fairly sensible to me and should not be changed to do string concatenation nor raise an exception:

>>> urljoin("//netloc/old/path", "new/path")
'//netloc/old/new/path'

I am posting urljoin-non-hier.patch as an alternative to my first patch. This one changes urljoin() to work on any URL scheme not in the existing “non_hierarchical” blacklist. I removed the gopher, wais, and imap schemes from the list, and added tel, so that urljoin() continues to treat these special cases as before. Out of the schemes mentioned in the module but missing from uses_relative, I think non_hierarchical now has all those without directory components: hdl, mailto, news, sip, sips, snews, tel, telnet.

However I am still not really convinced that my first urljoin-scheme.patch is a bad idea. Do people actually use urljoin() with these schemes like mailto in the first place?

demianbrecht · 2015-03-26T15:35:09Z

The current behaviour when no scheme is present is fairly sensible to me and should not be changed to do string concatenation nor raise an exception

Agreed. Defaulting to relative behaviour makes sense as I imagine that'll be the general use case.

I removed the gopher, wais, and imap schemes from the list

I'd be concerned about removing items as non_hierarchical /is/ public facing and it's reasonable to assume that there are libraries out there that depend on these. Additionally, at a glance through their respective RFCs, it seems that these three protocols /do/ belong in the non_hierarchical list. While WAIS and IMAP do use / as a delimiter, they're not hierarchical and therefore relative joining doesn't make sense. For example, with the following definition in mind (RFC4156):

wais://<host>:<port>/<database>/<wtype>/<wpath>

The following will result in an incorrect URL:

urljoin('wais://[email protected]/mydb/type/path', '/newpath')

However I am still not really convinced that my first urljoin-scheme.patch is a bad idea. Do people actually use urljoin() with these schemes like mailto in the first place?

I'd be inclined to agree that it's far from common practice. That said, I did find one instance of a project that seems to depend on current behaviour (although it's only in tests and I haven't looked any deeper): https://github.com/chfoo/wpull/blob/32837d7c5614d7f90b8242e1fbb41f8da9bc7ce7/wpull/url_test.py#L637. I imagine that the current behaviour may possibly be useful for utilities such as web crawlers. In light of that and the fact that the urllib.parse docs currently has a list of protocols that are intended to be supported across the entire module's API, I think that it's important to not break backwards compatibility in cases where the relative URL would have been returned. Your second patch seems to have this behaviour which I think is preferable.

vadmium · 2015-03-26T23:09:49Z

If necessary, we can add a new non_relative list, rather than changing non_hierarchical. The repository history shows that “non_hierarchical” was updated with new schemes once or twice, but has never been used since it was added to Python as “urlparse.py”.

IMAP, WAIS and Gopher URLs can have extra components added using slashes, which satisfies my idea of “hierarchical”. For IMAP, I think this is explicitly mentioned in the RFC: <https://tools.ietf.org/html/rfc5092#section-7\>. For WAIS, the hierarchy is not arbitrary, but your resulting URL wais://[email protected]/newpath probably matches the wais://<host>:<port>/<database> URL form, and I am not proposing to change that behaviour.

vadmium · 2016-09-13T04:22:49Z

Recording bugs reports for specific schemes as dependencies of this:

bpo-25895: ws(s)
bpo-16134: rtmp(e/s/t)
bpo-23759: coap(s)

rmorshea · 2023-05-26T17:36:34Z

This is causing problems in Poetry.

orsenthil · 2023-05-26T19:46:37Z

@rmorshea - oh no. This is such a old bug report. Would you explain if it is poetry problem or urlparse?

It just so happens that ssh is not in uses_relative. As a result, the following unexpected behavior arises:

Moving protocols to say different parsing behavior will break libraries and systems which are relying on the current behavior.

rmorshea · 2023-05-26T19:53:44Z

I'm not really sure what the right thing to do here is. With that said, I don't fully understand why svn+ssh is in the list, but not ssh and/or git+ssh - using relative URLs in .gitmodules to describe Git submodules is relatively common.

Thankfully it's possible to work around this issue:

def urljoinpath(base: str, path: str) -> str:
    parsed_base = urlparse(base)
    new = parsed_base._replace(path=urljoin(parsed_base.path, path))
    return urlunparse(new)

But it's odd that this is necessary. Perhaps, the function above would be a useful addition to urllib or perhaps urljoin needs a new parameter to allow for this behavior.

realazthat · 2024-02-23T03:07:18Z

def urljoinpath(base: str, path: str) -> str:
    parsed_base = urlparse(base)
    new = parsed_base._replace(path=urljoin(parsed_base.path, path))
    return urlunparse(new)

What about something like

def SmartURLJoin(base: str, url: str) -> str:
  base_pr = urlparse(base)
  bscheme = base_pr.scheme

  url_pr = urlparse(url)
  scheme = url_pr.scheme or bscheme
  if bscheme != scheme:
    return url

  base_pr = base_pr._replace(scheme='http')
  url_pr = url_pr._replace(scheme='http')

  joined = urljoin(urlunparse(base_pr), urlunparse(url_pr))
  joined_pr = urlparse(joined)
  joined_pr = joined_pr._replace(scheme=scheme)
  return urlunparse(joined_pr)

This seems slightly more in line with urljoin(), because it works with more than paths as the second argument.

scy · 2024-11-06T15:30:06Z

I would've expected urljoin to simply apply the algorithm defined in RFC 3986 section 5.2. That algorithm doesn't require any knowledge about whether a specific scheme is "hierarchical" or not.

urljoin('mailto:foo@', 'bar.com') becomes mailto:bar.com, because the base URL does not define an authority (called "netloc" in urllib.parse) part, because there's no // after the colon. The result does not point to a valid email address, but that's to be expected, because the second argument isn't a valid email address.
Also note that mailto: URLs are not hierarchical with regard to URL parsing: They have two components, a scheme of mailto and a path of [email protected]. urljoin("mailto:foo@", "bar.com"), assuming that urljoin is to be generic, should be interpreted as "keep the mailto scheme, and join the paths foo@ and bar.com". And since @ has no special meaning in a generic parser, merging the paths is simply "remove the last component from the base" and "add the path in the relative URL". (See also section 5.2.3.)
urljoin('mailto:person-foo/[email protected]', 'bar.com') should become mailto:person-foo/bar.com in a generic parser. Because slashes have special meanings in URLs: They delimit path components. @ characters only have a special meaning in the authority part of the URL, but mailto URLs don't have that. Therefore, the path components in the base URL are person-foo and [email protected]. It is not an URL parser's job to understand email addresses.
While slashes are allowed in email address local-parts, they have special meaning in URL paths and would need to be escaped in that mailto: URL.
urljoin('wais://[email protected]/mydb/type/path', '/newpath') should result in wais://[email protected]/newpath. While I understand that this is probably not the intended outcome given the semantics of the paths in WAIS URLs, it is not a generic parser's job to deal with that. If you want to join different <wpath>s to a WAIS URL, but keep the first two path components (<database> and <wtype>) intact, either use a protocol-specific join implementation, or a relative path in the second argument.

As someone who just read RFC 3986 and attempted to use urljoin on his custom, RFC-compliant URL scheme, I was "surprised" (to say the least) of Python's decision to come with magic lists of protocols where urljoin might work correctly and others where it might not, and these lists might probably change between releases. Right now, urllib.parse is not RFC compliant, even though it claims to be.

vadmium · 2024-11-06T21:52:14Z

I think someone needs to make the call if it is okay to break the current behaviour of joining to e.g. “mailto:” and “ssh:” URLs in order to make urljoin the same for any scheme. This is my preferred solution, implemented in urljoin-scheme.patch.

If not okay, the next best option is adding a new non_relative blacklist, which seemed to gather a bit of support. Like urljoin-non-hier.patch, but without editing the existing non_hierarchical list.

I am happy if someone wants to use my patches and bring them up to date with current Python, but I’m unlikely to spend the time myself.

mher mannequin added the type-bug An unexpected behavior, bug, or error label Aug 25, 2013

MadisonMay mannequin added the stdlib Python modules in the Lib dir label Aug 30, 2013

ezio-melotti transferred this issue from another repository Apr 10, 2022

rmorshea mentioned this issue May 26, 2023

Git dependencies' submodules with relative URLs handled incorrectly (regression from 1.1) python-poetry/poetry#6499

Closed

3 tasks

pirate mentioned this issue Apr 25, 2024

a bug of urllib.parse.urljoin ArchiveBox/ArchiveBox#1411

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

urljoin behaves differently with custom and standard schemas #63028

urljoin behaves differently with custom and standard schemas #63028

mher mannequin commented Aug 25, 2013

mher mannequin commented Aug 25, 2013

MadisonMay mannequin commented Aug 26, 2013

vadmium commented Aug 30, 2013

berkerpeksag commented Sep 2, 2013

MadisonMay mannequin commented Sep 2, 2013

MadisonMay mannequin commented Sep 2, 2013

vadmium commented Dec 17, 2014

demianbrecht mannequin commented Mar 17, 2015

vadmium commented Mar 19, 2015

demianbrecht mannequin commented Mar 19, 2015

demianbrecht mannequin commented Mar 19, 2015

berkerpeksag commented Mar 24, 2015

vadmium commented Mar 26, 2015

demianbrecht mannequin commented Mar 26, 2015

vadmium commented Mar 26, 2015

vadmium commented Sep 13, 2016

rmorshea commented May 26, 2023

orsenthil commented May 26, 2023

rmorshea commented May 26, 2023

realazthat commented Feb 23, 2024

scy commented Nov 6, 2024

vadmium commented Nov 6, 2024

urljoin behaves differently with custom and standard schemas #63028

urljoin behaves differently with custom and standard schemas #63028

Comments

mher mannequin commented Aug 25, 2013

mher mannequin commented Aug 25, 2013

MadisonMay mannequin commented Aug 26, 2013

vadmium commented Aug 30, 2013

berkerpeksag commented Sep 2, 2013

MadisonMay mannequin commented Sep 2, 2013

MadisonMay mannequin commented Sep 2, 2013

vadmium commented Dec 17, 2014

demianbrecht mannequin commented Mar 17, 2015

vadmium commented Mar 19, 2015

demianbrecht mannequin commented Mar 19, 2015

demianbrecht mannequin commented Mar 19, 2015

berkerpeksag commented Mar 24, 2015

vadmium commented Mar 26, 2015

demianbrecht mannequin commented Mar 26, 2015

vadmium commented Mar 26, 2015

vadmium commented Sep 13, 2016

rmorshea commented May 26, 2023

orsenthil commented May 26, 2023

rmorshea commented May 26, 2023

realazthat commented Feb 23, 2024

scy commented Nov 6, 2024

vadmium commented Nov 6, 2024