Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink how .ext formats (v.s. ?_format=) works before 1.0 #1439

Closed
simonw opened this issue Aug 17, 2021 · 48 comments
Closed

Rethink how .ext formats (v.s. ?_format=) works before 1.0 #1439

simonw opened this issue Aug 17, 2021 · 48 comments

Comments

@simonw
Copy link
Owner

simonw commented Aug 17, 2021

Datasette currently has surprising special behaviour for if a table name ends in .csv - which can happen when a tool like csvs-to-sqlite creates tables that match the filename that they were imported from.

https://latest.datasette.io/fixtures/table%2Fwith%2Fslashes.csv illustrates this behaviour: it links to .csv and .json that look like this:

Where normally Datasette would add the .csv or .json extension to the path component of the URL (as seen on other pages such as https://latest.datasette.io/fixtures/facet_cities) here the path_with_format() function notices that there is already a . in the path and instead adds ?_format=csv to the query string instead.

The problem with this mechanism is that it's pretty surprising. Anyone writing external code to Datasette who wants to get back the .csv or .json version giving the URL to a table page will need to know about and implement this behaviour themselves. That's likely to cause all kinds of bugs in the future.

@simonw
Copy link
Owner Author

simonw commented Aug 17, 2021

The challenge comes down to telling the difference between the following:

  • /db/table - an HTML table page
  • /db/table.csv - the CSV version of /db/table
  • /db/table.csv - no this one is actually a database table called table.csv
  • /db/table.csv.csv - the CSV version of /db/table.csv
  • /db/table.csv.csv.csv and so on...

simonw added a commit to simonw/til that referenced this issue Aug 17, 2021
@simonw
Copy link
Owner Author

simonw commented Aug 17, 2021

An alternative solution would be to use some form of escaping for the characters that form the name of the table.

The obvious way to do this would be URL-encoding - but it doesn't hold for . characters. The hex for that is %2E but watch what happens with that in a URL:

# Against Cloud Run:
curl -s 'https://datasette.io/-/asgi-scope/foo/bar%2Fbaz%2E' | rg path
 'path': '/-/asgi-scope/foo/bar/baz.',
 'raw_path': b'/-/asgi-scope/foo/bar%2Fbaz.',
 'root_path': '',
# Against Vercel:
curl -s 'https://til.simonwillison.net/-/asgi-scope/foo/bar%2Fbaz%2E' | rg path
 'path': '/-/asgi-scope/foo/bar%2Fbaz%2E',
 'raw_path': b'/-/asgi-scope/foo/bar%2Fbaz%2E',
 'root_path': '',

Surprisingly in this case Vercel DOES keep it intact, but Cloud Run does not.

It's still no good though: I need a solution that works on Vercel, Cloud Run and every other potential hosting provider too.

@simonw
Copy link
Owner Author

simonw commented Aug 18, 2021

But... what if I invent my own escaping scheme?

I actually did this once before, in 9fdb47c - while I was working on porting Datasette to ASGI in #272 (comment) because ASGI didn't yet have the raw_path mechanism.

I could bring that back - it looked like this:

    "table/and/slashes" => "tableU+002FandU+002Fslashes"
    "~table" => "U+007Etable"
    "+bobcats!" => "U+002Bbobcats!"
    "U+007Etable" => "UU+002B007Etable"

But I didn't particularly like it - it was quite verbose.

@simonw
Copy link
Owner Author

simonw commented Aug 18, 2021

Here's an alternative I just made up which I'm calling "dot dash" encoding:

def dot_dash_encode(s):
    return s.replace("-", "--").replace(".", "-.")

def dot_dash_decode(s):
    return s.replace("-.", ".").replace("--", "-")

And some examples:

for example in (
  "hello",
  "hello.csv",
  "hello-and-so-on.csv",
  "hello-.csv",
  "hello--and--so--on-.csv",
  "hello.csv.",
  "hello.csv.-",
  "hello.csv.--",
):
    print(example)
    print(dot_dash_encode(example))
    print(example == dot_dash_decode(dot_dash_encode(example)))
    print()

Outputs:

hello
hello
True

hello.csv
hello-.csv
True

hello-and-so-on.csv
hello--and--so--on-.csv
True

hello-.csv
hello---.csv
True

hello--and--so--on-.csv
hello----and----so----on---.csv
True

hello.csv.
hello-.csv-.
True

hello.csv.-
hello-.csv-.--
True

hello.csv.--
hello-.csv-.----
True

@simonw
Copy link
Owner Author

simonw commented Aug 18, 2021

So given the original examples, a table called table.csv would have the following URLs:

  • /db/table-.csv - the HTML version
  • /db/table-.csv.csv - the CSV version
  • /db/table-.csv.json - the JSON version

And if for some horific reason you had a table with the name /db/table-.csv.csv (so /db/ was the first part of the actual table name in SQLite) the URLs would look like this:

  • /db/%2Fdb%2Ftable---.csv-.csv - the HTML version
  • /db/%2Fdb%2Ftable---.csv-.csv.csv - the CSV version
  • /db/%2Fdb%2Ftable---.csv-.csv.json - the JSON version

@simonw
Copy link
Owner Author

simonw commented Aug 18, 2021

The documentation should definitely cover how table names become URLs, in case any third party code needs to be able to calculate this themselves.

@simonw
Copy link
Owner Author

simonw commented Aug 18, 2021

Maybe I should use -/ to encode forward slashes too, to defend against any ASGI servers that might not implement raw_path correctly.

@simonw simonw added this to the Datasette 1.0 milestone Dec 15, 2021
@simonw
Copy link
Owner Author

simonw commented Feb 7, 2022

I added a Link header to solve this problem for the JSON version in:

@simonw
Copy link
Owner Author

simonw commented Feb 18, 2022

Maybe I should use -/ to encode forward slashes too, to defend against any ASGI servers that might not implement raw_path correctly.

def dash_encode(s):
    return s.replace("-", "--").replace(".", "-.").replace("/", "-/")

def dash_decode(s):
    return s.replace("-/", "/").replace("-.", ".").replace("--", "-")
>>> dash_encode("foo/bar/baz.csv")
'foo-/bar-/baz-.csv'
>>> dash_decode('foo-/bar-/baz-.csv')
'foo/bar/baz.csv'

@simonw
Copy link
Owner Author

simonw commented Feb 18, 2022

(If I make this change it may break some existing Datasette installations when they upgrade - I could try and build a plugin for them which triggers on 404s and checks to see if the old format would return a 200 response, then returns that.)

@simonw
Copy link
Owner Author

simonw commented Feb 18, 2022

How does URL routing for https://latest.datasette.io/fixtures/table%2Fwith%2Fslashes.csv work?

Right now it's

datasette/datasette/app.py

Lines 1098 to 1101 in 7d24fd4

add_route(
TableView.as_view(self),
r"/(?P<db_name>[^/]+)/(?P<table_and_format>[^/]+?$)",
)

That's not going to capture the dot-dash encoding version of that table name:

>>> dot_dash_encode("table/with/slashes.csv")
'table-/with-/slashes-.csv'

Probably needs a fancy regex trick like a negative lookbehind assertion or similar.

@simonw
Copy link
Owner Author

simonw commented Feb 18, 2022

I want a match for this URL:

/db/table-/with-/slashes-.csv

Maybe this:

^/(?P<db_name>[^/]+)/(?P<table_and_format>([^/]*|(\-/)*|(\-\.)*|(\.\.)*)*$)

Here we are matching a sequence of:

([^/]*|(\-/)*|(\-\.)*|(\-\-)*)*

So a combination of not-slashes OR -/ or -. Or -- sequences

image

^/(?P<db_name>[^/]+)/(?P<table_and_format>([^/]*|(\-/)*|(\-\.)*|(\-\-)*)*$)

Try that with non-capturing bits:

^/(?P<db_name>[^/]+)/(?P<table_and_format>(?:[^/]*|(?:\-/)*|(?:\-\.)*|(?:\-\-)*)*$)

(?:[^/]*|(?:\-/)*|(?:\-\.)*|(?:\-\-)*)* visualized is:

image

Here's the explanation on regex101.com https://regex101.com/r/CPnsIO/1

image

@simonw
Copy link
Owner Author

simonw commented Feb 18, 2022

With this new pattern I could probably extract out the optional .json format string as part of the initial route capturing regex too, rather than the current table_and_format hack.

@simonw
Copy link
Owner Author

simonw commented Feb 18, 2022

I think I got format extraction working! https://regex101.com/r/A0bW1D/1

^/(?P<database>[^/]+)/(?P<table>(?:[^\/\-\.]*|(?:\-/)*|(?:\-\.)*|(?:\-\-)*)*?)(?:(?<!\-)\.(?P<format>\w+))?$

I had to make that crazy inner one even more complicated to stop it from capturing . that was not part of -..

(?:[^\/\-\.]*|(?:\-/)*|(?:\-\.)*|(?:\-\-)*)*

Visualized:

image

So now I have a regex which can extract out the dot-encoded table name AND spot if there is an optional .format at the end:

image

If I end up using this in Datasette it's going to need VERY comprehensive unit tests and inline documentation.

@simonw
Copy link
Owner Author

simonw commented Feb 18, 2022

And if for some horific reason you had a table with the name /db/table-.csv.csv (so /db/ was the first part of the actual table name in SQLite) the URLs would look like this:

* `/db/%2Fdb%2Ftable---.csv-.csv` - the HTML version
* `/db/%2Fdb%2Ftable---.csv-.csv.csv` - the CSV version
* `/db/%2Fdb%2Ftable---.csv-.csv.json` - the JSON version

Here's what those look like with the updated version of dot_dash_encode() that also encodes / as -/:

  • /db/-/db-/table---.csv-.csv - HTML
  • /db/-/db-/table---.csv-.csv.csv - CSV
  • /db/-/db-/table---.csv-.csv.json - JSON

image

@simonw
Copy link
Owner Author

simonw commented Feb 18, 2022

Ugh, one disadvantage I just spotted with this: Datasette already has a /-/versions.json convention where "system" URLs are namespaced under /-/ - but that could be confused under this new scheme with the -/ escaping sequence.

And I've thought about adding /db/-/special and /db/table/-/special URLs in the past too.

Maybe change this system to use . as the escaping character instead of -?

@simonw
Copy link
Owner Author

simonw commented Feb 18, 2022

def dot_encode(s):
    return s.replace(".", "..").replace("/", "./")

def dot_decode(s):
    return s.replace("./", "/").replace("..", ".")

No need for hyphen encoding in this variant at all, which simplifies things a bit.

(Update: this is flawed, see #1439 (comment))

@simonw
Copy link
Owner Author

simonw commented Feb 18, 2022

>>> dot_encode("/db/table-.csv.csv")
'./db./table-..csv..csv'
>>> dot_decode('./db./table-..csv..csv')
'/db/table-.csv.csv'

I worry that web servers might treat ./ in a special way though.

@simonw
Copy link
Owner Author

simonw commented Feb 18, 2022

Do both of those survive the round-trip to populate raw_path correctly?

No! In both cases the /./ bit goes missing.

It looks like this might even be a client issue - curl shows me this:

~ % curl -vv -i 'https://datasette.io/-/asgi-scope/db/./db./table-..csv..csv'
*   Trying 216.239.32.21:443...
* Connected to datasette.io (216.239.32.21) port 443 (#0)
* ALPN, offering http/1.1
* TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate: datasette.io
* Server certificate: R3
* Server certificate: ISRG Root X1
> GET /-/asgi-scope/db/db./table-..csv..csv HTTP/1.1

So curl decided to turn /-/asgi-scope/db/./db./table into /-/asgi-scope/db/db./table before even sending the request.

@simonw
Copy link
Owner Author

simonw commented Feb 18, 2022

Ugh, one disadvantage I just spotted with this: Datasette already has a /-/versions.json convention where "system" URLs are namespaced under /-/ - but that could be confused under this new scheme with the -/ escaping sequence.

And I've thought about adding /db/-/special and /db/table/-/special URLs in the past too.

I don't think this matters. The new regex does indeed capture that kind of page:

image

But Datasette goes through configured route regular expressions in order - so I can have the regex that captures /db/-/special routes listed before the one that captures tables and formats.

@simonw
Copy link
Owner Author

simonw commented Feb 18, 2022

def dash_encode(s):
    return s.replace("-", "--").replace(".", "-.").replace("/", "-/")

def dash_decode(s):
    return s.replace("-/", "/").replace("-.", ".").replace("--", "-")

I think dash-encoding (new name for this) is the right way forward here.

@simonw
Copy link
Owner Author

simonw commented Feb 18, 2022

One other potential variant:

def dash_encode(s):
    return s.replace("-", "-dash-").replace(".", "-dot-").replace("/", "-slash-")

def dash_decode(s):
    return s.replace("-slash-", "/").replace("-dot-", ".").replace("-dash-", "-")

Except this has bugs - it doesn't round-trip safely, because it can get confused about things like -dash-slash- in terms of is that a -dash- or a -slash-?

>>> dash_encode("/db/table-.csv.csv")
'-slash-db-slash-table-dash--dot-csv-dot-csv'
>>> dash_decode('-slash-db-slash-table-dash--dot-csv-dot-csv')
'/db/table-.csv.csv'
>>> dash_encode('-slash-db-slash-table-dash--dot-csv-dot-csv')
'-dash-slash-dash-db-dash-slash-dash-table-dash-dash-dash--dash-dot-dash-csv-dash-dot-dash-csv'
>>> dash_decode('-dash-slash-dash-db-dash-slash-dash-table-dash-dash-dash--dash-dot-dash-csv-dash-dot-dash-csv')
'-dash/dash-db-dash/dash-table-dash--dash.dash-csv-dash.dash-csv'

@simonw
Copy link
Owner Author

simonw commented Feb 18, 2022

This made me worry that my current dash_decode() implementation had unknown round-trip bugs, but thankfully this works OK:

>>> dash_encode("/db/table-.csv.csv")
'-/db-/table---.csv-.csv'
>>> dash_encode('-/db-/table---.csv-.csv')
'---/db---/table-------.csv---.csv'
>>> dash_decode('---/db---/table-------.csv---.csv')
'-/db-/table---.csv-.csv'
>>> dash_decode('-/db-/table---.csv-.csv')
'/db/table-.csv.csv'

The regex still works against that double-encoded example too:

image

@simonw
Copy link
Owner Author

simonw commented Feb 18, 2022

Adopting this could result in supporting database files with surprising characters in their filename too.

@simonw
Copy link
Owner Author

simonw commented Feb 18, 2022

Should it encode % symbols too, since they have a special meaning in URLs and we can't guarantee that every single web server / proxy out there will round-trip them safely using percentage encoding? If so, would need to pick a different encoding character for them. Maybe % becomes -p - and in that case / could become -s too.

Is it worth expanding dash-encoding outside of just / and - and . though? Not sure.

@simonw
Copy link
Owner Author

simonw commented Feb 18, 2022

Here's a useful modern spec for how existing URL percentage encoding is supposed to work: https://url.spec.whatwg.org/#percent-encoded-bytes

@simonw
Copy link
Owner Author

simonw commented Feb 18, 2022

Note that I've ruled out using Accept: application/json to return JSON because it turns out Cloudflare and potentially other CDNs ignore the Vary: Accept header entirely:

@simonw
Copy link
Owner Author

simonw commented Mar 5, 2022

Lots of great conversations about the dash encoding implementation on Twitter: https://twitter.com/simonw/status/1500228316309061633

@dracos helped me figure out a simpler regex: https://twitter.com/dracos/status/1500236433809973248

^/(?P<database>[^/]+)/(?P<table>[^\/\-\.]*|\-/|\-\.|\-\-)*(?P<format>\.\w+)?$

image

@simonw
Copy link
Owner Author

simonw commented Mar 5, 2022

This comment from glyph got me thinking:

Have you considered replacing % with some other character and then using percent-encoding?

What happens if a table name includes a % character and that ends up getting mangled by a misbehaving proxy?

I should consider % in the escaping system too. And maybe go with that suggestion of using percent-encoding directly but with a different character.

@simonw
Copy link
Owner Author

simonw commented Mar 5, 2022

I want to try regular percentage encoding, except that it also encodes both the - and the . characters, AND it uses - instead of % as the encoding character.

Should check what it does with emoji too.

@simonw
Copy link
Owner Author

simonw commented Mar 5, 2022

OK, for that percentage thing: the Python core implementation of URL percentage escaping deliberately ignores two of the characters we want to escape: . and -:

https://github.com/python/cpython/blob/6927632492cbad86a250aa006c1847e03b03e70b/Lib/urllib/parse.py#L780-L783

_ALWAYS_SAFE = frozenset(b'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
                         b'abcdefghijklmnopqrstuvwxyz'
                         b'0123456789'
                         b'_.-~')

It also defaults to skipping / (passed as a safe= parameter to various things).

I'm going to try borrowing and modifying the core of the Python implementation: https://github.com/python/cpython/blob/6927632492cbad86a250aa006c1847e03b03e70b/Lib/urllib/parse.py#L795-L814

class _Quoter(dict):
    """A mapping from bytes numbers (in range(0,256)) to strings.
    String values are percent-encoded byte values, unless the key < 128, and
    in either of the specified safe set, or the always safe set.
    """
    # Keeps a cache internally, via __missing__, for efficiency (lookups
    # of cached keys don't call Python code at all).
    def __init__(self, safe):
        """safe: bytes object."""
        self.safe = _ALWAYS_SAFE.union(safe)

    def __repr__(self):
        return f"<Quoter {dict(self)!r}>"

    def __missing__(self, b):
        # Handle a cache miss. Store quoted string in cache and return.
        res = chr(b) if b in self.safe else '%{:02X}'.format(b)
        self[b] = res
        return res

@simonw
Copy link
Owner Author

simonw commented Mar 6, 2022

_ESCAPE_SAFE = frozenset(
    b'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
    b'abcdefghijklmnopqrstuvwxyz'
    b'0123456789_'
)
# I removed b'.-~')

class Quoter(dict):
    # Keeps a cache internally, via __missing__
    def __missing__(self, b):
        # Handle a cache miss. Store quoted string in cache and return.
        res = chr(b) if b in _ESCAPE_SAFE else '-{:02X}'.format(b)
        self[b] = res
        return res

quoter = Quoter().__getitem__

''.join([quoter(char) for char in b'foo/bar.csv'])
# 'foo-2Fbar-2Ecsv'

@karlcow
Copy link

karlcow commented Mar 6, 2022

Probably too late… but I have just seen this because
http://simonwillison.net/2022/Mar/5/dash-encoding/#atom-everything

And it reminded me of comma tools at W3C.
http://www.w3.org/,tools

Example, the text version of W3C homepage
https://www.w3.org/,text

The challenge comes down to telling the difference between the following:

* `/db/table` - an HTML table page

/db/table

* `/db/table.csv` - the CSV version of `/db/table`

/db/table,csv

* `/db/table.csv` - no this one is actually a database table called `table.csv`

/db/table.csv

* `/db/table.csv.csv` - the CSV version of `/db/table.csv`

/db/table.csv,csv

* `/db/table.csv.csv.csv` and so on...

/db/table.csv.csv,csv

I haven't checked all the cases in the thread.

@simonw
Copy link
Owner Author

simonw commented Mar 6, 2022

Needs more testing, but this seems to work for decoding the percent-escaped-with-dashes format: urllib.parse.unquote(s.replace('-', '%'))

@simonw
Copy link
Owner Author

simonw commented Mar 6, 2022

Suggestion from a conversation with Seth Michael Larson: it would be neat if plugins could easily integrate with whatever scheme this ends up using, maybe with the /db/table/-/plugin-name standardized pattern or similar.

Making it easy for plugins to do the right, consistent thing is a good idea.

@simonw
Copy link
Owner Author

simonw commented Mar 6, 2022

Test:

@pytest.mark.parametrize(
"original,expected",
(
("abc", "abc"),
("/foo/bar", "-2Ffoo-2Fbar"),
("/-/bar", "-2F-2D-2Fbar"),
("-/db-/table.csv", "-2D-2Fdb-2D-2Ftable-2Ecsv"),
(r"%~-/", "-25-7E-2D-2F"),
("-25-7E-2D-2F", "-2D25-2D7E-2D2D-2D2F"),
),
)
def test_dash_encoding(original, expected):
actual = utils.dash_encode(original)
assert actual == expected
# And test round-trip
assert original == utils.dash_decode(actual)

One big advantage to this scheme is that redirecting old links to %2F pages (e.g. https://fivethirtyeight.datasettes.com/fivethirtyeight/twitter-ratio%2Fsenators) is easy - if you see a % in the raw_path, redirect to that page with the % replaced by -.

simonw added a commit that referenced this issue Mar 7, 2022
* Dash encoding functions, tests and docs, refs #1439
* dash encoding is now like percent encoding but with dashes
* Use dash-encoding for row PKs and ?_next=, refs #1439
* Use dash encoding for table names, refs #1439
* Use dash encoding for database names, too, refs #1439

See also https://simonwillison.net/2022/Mar/5/dash-encoding/
@simonw
Copy link
Owner Author

simonw commented Mar 7, 2022

I didn't need to do any of the fancy regular expression routing stuff after all, since the new dash encoding format avoids using / so a simple [^/]+ can capture the correct segments from the URL.

@simonw
Copy link
Owner Author

simonw commented Mar 13, 2022

OK, this has broken a lot more than I expected it would.

Turns out - is a very common character in existing Datasette database names!

https://datasette.io/-/databases for example has two:

[
    {
        "name": "docs-index",
        "path": "docs-index.db",
        "size": 1007616,
        "is_mutable": false,
        "is_memory": false,
        "hash": "0ac6c3de2762fcd174fd249fed8a8fa6046ea345173d22c2766186bf336462b2"
    },
    {
        "name": "dogsheep-index",
        "path": "dogsheep-index.db",
        "size": 5496832,
        "is_mutable": false,
        "is_memory": false,
        "hash": "d1ea238d204e5b9ae783c86e4af5bcdf21267c1f391de3e468d9665494ee012a"
    }
]

simonw added a commit to simonw/datasette.io that referenced this issue Mar 13, 2022
simonw added a commit to simonw/covid-19-datasette that referenced this issue Mar 13, 2022
@simonw
Copy link
Owner Author

simonw commented Mar 13, 2022

If I want to reserve - as a character that CAN be used in URLs, the only remaining character that might make sense for escape sequences is ~ - based on this last line of characters that are escape from percentage encoding:

_ALWAYS_SAFE = frozenset(b'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
                         b'abcdefghijklmnopqrstuvwxyz'
                         b'0123456789'
                         b'_.-~')

So I'd add both - and _ back to the safe list, but use ~ to escape . and / and suchlike.

@simonw
Copy link
Owner Author

simonw commented Mar 15, 2022

I'm happy with this now that I've landed Tilde encoding in #1657.

@simonw simonw closed this as completed Mar 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants