Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

base_url is omitted in JSON and CSV views #1519

Closed
phubbard opened this issue Nov 19, 2021 · 22 comments
Closed

base_url is omitted in JSON and CSV views #1519

phubbard opened this issue Nov 19, 2021 · 22 comments
Labels

Comments

@phubbard
Copy link

phubbard commented Nov 19, 2021

I have a datasette deployment, using Apache2 to reverse proxy:

   ProxyPass /ged http://thor.phfactor.net:8001
   ProxyPreserveHost On

In settings.json I have

{
    "base_url": "/ged/",
    "trace_debug": 1,
    "template_debug": 1
}

and datasette works correctly. However, if you view a query and then click on the 'This data as json, CSV' both links omit the base_url prefix and are therefore 404.

@simonw simonw added the bug label Nov 19, 2021
@simonw
Copy link
Owner

simonw commented Nov 19, 2021

base_url has been a source of so many bugs like this! I often find them quite hard to replicate, likely because I haven't made myself a good Apache mod_proxy testing environment yet.

@simonw
Copy link
Owner

simonw commented Nov 19, 2021

Having a live demo running on Cloud Run that proxies through Apache and uses base_url would be incredibly useful for replicating and debugging this kind of thing. I wonder how hard it is to run Apache and mod_proxy in the same Docker container as Datasette?

@simonw
Copy link
Owner

simonw commented Nov 19, 2021

I now have a Dockerfile in #1521 (comment) that I can use to run a local Apache 2 with mod_proxy to investigate this class of bugs!

@simonw simonw reopened this Nov 19, 2021
@simonw
Copy link
Owner

simonw commented Nov 19, 2021

Bug confirmed:

proxy-bug

@simonw
Copy link
Owner

simonw commented Nov 19, 2021

The relevant test is this one:

datasette/tests/test_html.py

Lines 1608 to 1649 in 3025505

@pytest.mark.parametrize(
"path",
[
"/",
"/fixtures",
"/fixtures/compound_three_primary_keys",
"/fixtures/compound_three_primary_keys/a,a,a",
"/fixtures/paginated_view",
"/fixtures/facetable",
],
)
def test_base_url_config(app_client_base_url_prefix, path):
client = app_client_base_url_prefix
response = client.get("/prefix/" + path.lstrip("/"))
soup = Soup(response.body, "html.parser")
for el in soup.findAll(["a", "link", "script"]):
if "href" in el.attrs:
href = el["href"]
elif "src" in el.attrs:
href = el["src"]
else:
continue # Could be a <script>...</script>
if (
not href.startswith("#")
and href
not in {
"https://datasette.io/",
"https://github.com/simonw/datasette",
"https://github.com/simonw/datasette/blob/main/LICENSE",
"https://github.com/simonw/datasette/blob/main/tests/fixtures.py",
"/login-as-root", # Only used for the latest.datasette.io demo
}
and not href.startswith("https://plugin-example.datasette.io/")
):
# If this has been made absolute it may start http://localhost/
if href.startswith("http://localhost/"):
href = href[len("http://localost/") :]
assert href.startswith("/prefix/"), {
"path": path,
"href_or_src": href,
"element_parent": str(el.parent),
}

I modified that test to add "/fixtures/facetable?sql=select+1" as one of the tested paths, and dropped in an assert False to pause it in the debugger:

    @pytest.mark.parametrize(
        "path",
        [
            "/",
            "/fixtures",
            "/fixtures/compound_three_primary_keys",
            "/fixtures/compound_three_primary_keys/a,a,a",
            "/fixtures/paginated_view",
            "/fixtures/facetable",
            "/fixtures?sql=select+1",
        ],
    )
    def test_base_url_config(app_client_base_url_prefix, path):
        client = app_client_base_url_prefix
        response = client.get("/prefix/" + path.lstrip("/"))
        soup = Soup(response.body, "html.parser")
        if path == "/fixtures?sql=select+1":
>           assert False
E           assert False

BUT... in the debugger:

(Pdb) print(soup)
...
<p class="export-links">This data as
  <a href="/prefix/fixtures.json?sql=select+1">json</a>,
  <a href="/prefix/fixtures.testall?sql=select+1">testall</a>,
  <a href="/prefix/fixtures.testnone?sql=select+1">testnone</a>,
  <a href="/prefix/fixtures.testresponse?sql=select+1">testresponse</a>,
  <a href="/prefix/fixtures.csv?sql=select+1&amp;_size=max">CSV</a></p>

Those all have the correct prefix! But that's not what I'm seeing in my Dockerfile reproduction of the issue.

Something very weird is going on here.

@simonw
Copy link
Owner

simonw commented Nov 19, 2021

I added template_debug in the Dockerfile:

datasette fixtures.db --setting template_debug 1 --setting base_url "/foo/bar/" -p 9000 &\n\

And then hit http://localhost:5000/foo/bar/fixtures?sql=select+*+from+compound_three_primary_keys+limit+1&_context=1 to view the template context - and it showed the bug, output edited to just show relevant keys:

{
    "edit_sql_url": "/foo/bar/fixtures?sql=select+%2A+from+compound_three_primary_keys+limit+1",
    "settings": {
        "force_https_urls": false,
        "template_debug": true,
        "trace_debug": false,
        "base_url": "/foo/bar/"
    },
    "show_hide_link": "/fixtures?sql=select+%2A+from+compound_three_primary_keys+limit+1&_context=1&_hide_sql=1",
    "show_hide_text": "hide",
    "show_hide_hidden": "",
    "renderers": {
        "json": "/fixtures.json?sql=select+*+from+compound_three_primary_keys+limit+1&_context=1"
    },
    "url_csv": "/fixtures.csv?sql=select+*+from+compound_three_primary_keys+limit+1&_context=1&_size=max",
    "url_csv_path": "/fixtures.csv",
    "base_url": "/foo/bar/"
}

This is so strange. edit_sql_url and base_url are correct, but show_hide_link and url_csv and renderers.json are not.

And it's really strange that the bug doesn't show up in the tests.

@simonw
Copy link
Owner

simonw commented Nov 19, 2021

Here's the code that generates edit_sql_url correctly:

if allow_execute_sql and is_validated_sql and ":_" not in sql:
edit_sql_url = (
self.ds.urls.database(database)
+ "?"
+ urlencode(

And here's the code for show_hide_link:

if bool(params.get("_show_sql")):
show_hide_link = path_with_removed_args(request, {"_show_sql"})

And for url_csv:

url_csv = path_with_format(
request=request, format="csv", extra_qs=url_csv_args
)

@simonw
Copy link
Owner

simonw commented Nov 19, 2021

The implementations of path_with_removed_args and path_with_format:

def path_with_removed_args(request, args, path=None):
query_string = request.query_string
if path is None:
path = request.path
else:
if "?" in path:
bits = path.split("?", 1)
path, query_string = bits
# args can be a dict or a set
current = []
if isinstance(args, set):
def should_remove(key, value):
return key in args
elif isinstance(args, dict):
# Must match key AND value
def should_remove(key, value):
return args.get(key) == value
for key, value in urllib.parse.parse_qsl(query_string):
if not should_remove(key, value):
current.append((key, value))
query_string = urllib.parse.urlencode(current)
if query_string:
query_string = f"?{query_string}"
return path + query_string

def path_with_format(
*, request=None, path=None, format=None, extra_qs=None, replace_format=None
):
qs = extra_qs or {}
path = request.path if request else path
if replace_format and path.endswith(f".{replace_format}"):
path = path[: -(1 + len(replace_format))]
if "." in path:
qs["_format"] = format
else:
path = f"{path}.{format}"
if qs:
extra = urllib.parse.urlencode(sorted(qs.items()))
if request and request.query_string:
path = f"{path}?{request.query_string}&{extra}"
else:
path = f"{path}?{extra}"
elif request and request.query_string:
path = f"{path}?{request.query_string}"
return path

@simonw
Copy link
Owner

simonw commented Nov 19, 2021

In the ?_context= debug view the request looks like this:

    "request": "<datasette.utils.asgi.Request object at 0x7faf9fe06200>",

I'm going to add a repr() to it such that it's a bit more useful.

simonw added a commit that referenced this issue Nov 19, 2021
@simonw
Copy link
Owner

simonw commented Nov 19, 2021

Modified my Dockerfile to do this:

RUN pip install https://github.com/simonw/datasette/archive/ff0dd4da38d48c2fa9250ecf336002c9ed724e36.zip

And now the request in that debug ?_context=1 looks like this:

    "request": "<asgi.Request method=\"GET\" url=\"http://localhost:9000/fixtures?sql=select+*+from+compound_three_primary_keys+limit+1&_context=1\">"

That explains the bug - that request doesn't maintain the original path prefix of http://localhost:5000/foo/bar/fixtures?sql= (also it's been rewritten to localhost:9000 instead of localhost:5000).

@simonw
Copy link
Owner

simonw commented Nov 19, 2021

Still not clear why the tests pass but the live example fails.

@simonw
Copy link
Owner

simonw commented Nov 19, 2021

Figured it out! The test is not an accurate recreation of what is happening, because it doesn't simulate a request with a path of /fixtures that has been redirected by the proxy to /prefix/fixtures.

@simonw
Copy link
Owner

simonw commented Nov 19, 2021

https://docs.datasette.io/en/stable/deploying.html#apache-proxy-configuration says I should use ProxyPreserveHost on.

@simonw
Copy link
Owner

simonw commented Nov 19, 2021

I think what's happening here is Apache is actually making a request to /fixtures rather than making a request to /prefix/fixtures - and Datasette is replying to requests on both the prefixed and the non-prefixed paths.

This is pretty confusing! I think Datasette should ONLY reply to /prefix/fixtures instead and return a 404 for /fixtures - this would make things a whole lot easier to debug.

But shipping that change could break existing deployments. Maybe that should be a breaking change for 1.0.

@simonw
Copy link
Owner

simonw commented Nov 19, 2021

In the meantime I can catch these errors by changing the test to run each path twice, once with and once without the prefix. This should accurately simulate how Apache is working here.

@simonw
Copy link
Owner

simonw commented Nov 20, 2021

Thanks to #1522 I have a live demo that exhibits this bug now: https://apache-proxy-demo.datasette.io/prefix/fixtures/attraction_characteristic

@simonw
Copy link
Owner

simonw commented Nov 20, 2021

In the meantime I can catch these errors by changing the test to run each path twice, once with and once without the prefix. This should accurately simulate how Apache is working here.

This worked, I managed to get the tests to fail! Here's the change I made:

diff --git a/tests/test_html.py b/tests/test_html.py
index f24165b..dbdfe59 100644
--- a/tests/test_html.py
+++ b/tests/test_html.py
@@ -1614,12 +1614,19 @@ def test_metadata_sort_desc(app_client):
         "/fixtures/compound_three_primary_keys/a,a,a",
         "/fixtures/paginated_view",
         "/fixtures/facetable",
+        "/fixtures?sql=select+1",
     ],
 )
-def test_base_url_config(app_client_base_url_prefix, path):
+@pytest.mark.parametrize("use_prefix", (True, False))
+def test_base_url_config(app_client_base_url_prefix, path, use_prefix):
     client = app_client_base_url_prefix
-    response = client.get("/prefix/" + path.lstrip("/"))
+    path_to_get = path
+    if use_prefix:
+        path_to_get = "/prefix/" + path.lstrip("/")
+    response = client.get(path_to_get)
     soup = Soup(response.body, "html.parser")
+    if path == "/fixtures?sql=select+1":
+        assert False
     for el in soup.findAll(["a", "link", "script"]):
         if "href" in el.attrs:
             href = el["href"]
@@ -1642,11 +1649,12 @@ def test_base_url_config(app_client_base_url_prefix, path):
             # If this has been made absolute it may start http://localhost/
             if href.startswith("http://localhost/"):
                 href = href[len("http://localost/") :]
-            assert href.startswith("/prefix/"), {
+            assert href.startswith("/prefix/"), json.dumps({
                 "path": path,
+                "path_to_get":  path_to_get,
                 "href_or_src": href,
                 "element_parent": str(el.parent),
-            }
+            }, indent=4, default=repr)
 
 
 def test_base_url_affects_metadata_extra_css_urls(app_client_base_url_prefix):

@simonw
Copy link
Owner

simonw commented Nov 20, 2021

Adding that test found (I hope!) all of the remaining base_url bugs. There were a bunch! I think I finally get to close #838 too.

simonw added a commit that referenced this issue Nov 20, 2021
@simonw
Copy link
Owner

simonw commented Nov 20, 2021

Ouch a nasty bug crept through there - https://datasette-apache-proxy-demo-j7hipcg4aq-uc.a.run.app/prefix/fixtures/compound_three_primary_keys says

500: name 'ds' is not defined

@simonw
Copy link
Owner

simonw commented Nov 20, 2021

OK, i think I got all of them this time!

The latest demo is now live at https://datasette-apache-proxy-demo.fly.dev/prefix/fixtures/sortable?_facet=pk2

I'm closing this issue, but feel free to re-open it if you spot any that I missed.

@simonw simonw closed this as completed Nov 20, 2021
@simonw
Copy link
Owner

simonw commented Nov 20, 2021

I think what's happening here is Apache is actually making a request to /fixtures rather than making a request to /prefix/fixtures - and Datasette is replying to requests on both the prefixed and the non-prefixed paths.

This is pretty confusing! I think Datasette should ONLY reply to /prefix/fixtures instead and return a 404 for /fixtures - this would make things a whole lot easier to debug.

But shipping that change could break existing deployments. Maybe that should be a breaking change for 1.0.

On further thought I'm not going to do this. Having Datasette work behind a proxy the way it does right now is clearly easy for people to deploy (now that I've fixed the bugs) and I trust my improved tests to catch problems in the future.

@phubbard
Copy link
Author

phubbard commented Dec 1, 2021

thanks so very much for the prompt attention and fix! Plus, the animated GIF showing the bug is just extra and I love it. Interactions like this are why I love open source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants