Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect URLs when served behind a proxy with base_url set #838

Closed
tsibley opened this issue Jun 11, 2020 · 14 comments
Closed

Incorrect URLs when served behind a proxy with base_url set #838

tsibley opened this issue Jun 11, 2020 · 14 comments
Labels
Milestone

Comments

@tsibley
Copy link

tsibley commented Jun 11, 2020

I'm running datasette serve --config base_url:/foo/ …, proxying to it with this Apache config:

    ProxyPass /foo/ http://localhost:8001/ 
    ProxyPassReverse /foo/ http://localhost:8001/ 

and then accessing it via https://example.com/foo/.

Although many of the URLs in the pages are correct (presumably because they either use absolute paths which include base_url or relative paths), the faceting and pagination links still use fully-qualified URLs pointing at http://localhost:8001.

I looked into this a little in the source code, and it seems to be an issue anywhere request.url or request.path is used, as these contain the values for the request between the frontend (Apache) and backend (Datasette) server. Those properties are primarily used via the path_with_… family of utility functions and the Datasette.absolute_url method.

@tsibley tsibley changed the title Incorrect URLs when served behind a proxy and base_url set Incorrect URLs when served behind a proxy with base_url set Jun 11, 2020
@simonw simonw added the bug label Jun 12, 2020
@simonw
Copy link
Owner

simonw commented Jun 12, 2020

Have you tried this without the ProxyPassReverse directive? I'm worried that might be confusing Datasette.

This is the test I used to ensure this feature works - it scrapes all of the links on a bunch of different pages. Could it be missing something here?

datasette/tests/test_html.py

Lines 1233 to 1274 in 647c5ff

@pytest.mark.parametrize("base_url", ["/prefix/", "https://example.com/"])
@pytest.mark.parametrize(
"path",
[
"/",
"/fixtures",
"/fixtures/compound_three_primary_keys",
"/fixtures/compound_three_primary_keys/a,a,a",
"/fixtures/paginated_view",
"/fixtures/facetable",
],
)
def test_base_url_config(base_url, path):
with make_app_client(config={"base_url": base_url}) as client:
response = client.get(base_url + path.lstrip("/"))
soup = Soup(response.body, "html.parser")
for el in soup.findAll(["a", "link", "script"]):
if "href" in el.attrs:
href = el["href"]
elif "src" in el.attrs:
href = el["src"]
else:
continue # Could be a <script>...</script>
if (
not href.startswith("#")
and href
not in {
"https://github.com/simonw/datasette",
"https://github.com/simonw/datasette/blob/master/LICENSE",
"https://github.com/simonw/datasette/blob/master/tests/fixtures.py",
}
and not href.startswith("https://plugin-example.com/")
):
# If this has been made absolute it may start http://localhost/
if href.startswith("http://localhost/"):
href = href[len("http://localost/") :]
assert href.startswith(base_url), {
"base_url": base_url,
"path": path,
"href_or_src": href,
"element_parent": str(el.parent),
}

@tsibley
Copy link
Author

tsibley commented Jun 12, 2020

Hmm, I haven't tried removing ProxyPassReverse, but it doesn't touch the HTML, which is the issue I'm seeing. You can read the documentation here. ProxyPassReverse is a standard directive when proxying with Apache. I've used it dozens of times with other applications.

Looking a little more at the code, I think the issue here is that the behaviour of base_url makes sense when Datasette is mounted at a path within a larger application, but not when HTTP requests are being proxied to it.

In a mount situation, it is perfectly fine to construct URLs reusing the domain and path from the request. In a proxy situation, it never is, as the domain and path in the request are not the domain and path that the non-proxy client actually needs to use. That is, links which include the Apache → Datasette request origin, localhost:8001, instead of the browser → Apache request origin, example.com, will be broken.

The tests you pointed to also reflect this in two ways:

  1. They strip a leading http://localhost, allowing such URLs in the facet links to pass, but inclusion of that in a proxy situation would mean the URL is broken.

  2. The test client emits direct ASGI events instead of actual proxied HTTP requests. The headers of these ASGI events don't reflect the way an HTTP proxy works; instead they pass through the original request path which contains base_url. This works because Datasette responds to requests equivalently at either /… or /{base_url}/…, which makes some sense in a mount situation but is unconventional (albeit workable) for a proxied app.

Apps that support being proxied automatically support being mounted, but apps that only support being mounted don't automatically support being proxied.

@ChristopherWilks
Copy link

I also am seeing the same issue with an Apache setup (same even w/o ProxyPassReverse, though I typically use it as @tsibley stated).

But also want to say thanks for a great tool (this issue not withstanding)!

@tballison
Copy link

But also want to say thanks for a great tool

+1!

@simonw
Copy link
Owner

simonw commented Oct 15, 2020

Tracking ticket: #1023

@simonw
Copy link
Owner

simonw commented Oct 20, 2020

OK, I've made a ton of improvements to how the base_url setting works - see tickets linked from #1023. I've just pushed out an alpha release with those changes in it: https://github.com/simonw/datasette/releases/tag/0.51a0

@tsibley @tballison @ChristopherWilks I'd really appreciate your help testing this alpha!

You can install it with:

pip install datasette==0.51a0

It should work with just ProxyPass, without needing the ProxyPassReverse setting.

@simonw simonw added this to the 0.51 milestone Oct 23, 2020
@psychemedia
Copy link
Contributor

psychemedia commented Oct 25, 2020

I'm trying to run something behind a MyBinder proxy, but seem to have something set up incorrectly and not sure what the fix is?

I'm starting datasette with jupyter-server-proxy setup:

# __init__.py
def setup_nbsearch():

    return {
        "command": [
            "datasette",
            "serve",
            f"{_NBSEARCH_DB_PATH}",
            "-p",
            "{port}",
            "--config",
            "base_url:{base_url}nbsearch/"
        ],
        "absolute_url": True,
        # The following needs a the labextension installing.
        # eg in postBuild: jupyter labextension install jupyterlab-server-proxy
        "launcher_entry": {
            "enabled": True,
            "title": "nbsearch",
        },
    }

where the base_url gets automatically populated by the server-proxy. I define the loaders as:

# __init__.py
from datasette import hookimpl

@hookimpl
def extra_css_urls(database, table, columns, view_name, datasette):
    return [
        "/-/static-plugins/nbsearch/prism.css",
        "/-/static-plugins/nbsearch/nbsearch.css",
    ]

but these seem to also need a base_url prefix set somehow?

Currently, the generated HTML loads properly but internal links are incorrect; eg they take the form <link rel="stylesheet" href="/-/static-plugins/nbsearch/prism.css"> which resolves to eg https://notebooks.gesis.org/hub/-/static-plugins/nbsearch/prism.css rather than required URL of form https://notebooks.gesis.org/binder/jupyter/user/ouseful-testing-nbsearch-0fx1mx67/nbsearch/-/static-plugins/nbsearch/prism.css.

The main css is loaded correctly: <link rel="stylesheet" href="/binder/jupyter/user/ouseful-testing-nbsearch-0fx1mx67/nbsearch/-/static/app.css?404439">

@simonw
Copy link
Owner

simonw commented Oct 31, 2020

OK, this should be working now. You can use the datasette.urls.static_plugins() method to generate the correct URLs in the extra_css_urls plugin hook: https://docs.datasette.io/en/latest/internals.html#datasette-urls

@simonw simonw closed this as completed Oct 31, 2020
@psychemedia
Copy link
Contributor

Thanks; just a note that the datasette.urls.static(path) and datasette.urls.static_plugins(plugin_name, path) items both seem to be repeated and appear in the docs twice?

tsibley added a commit to seattleflu/switchboard that referenced this issue Mar 4, 2021
Unpins Datasette and upgrades to the latest version.  Adjusts custom
page styles and command-line options for upstream changes.  Removes the
link-fixing JS hack now that base_url is properly supported.¹

There is still one bug related to base_url² which affects custom pages
like our barcode dialer.  I've worked around this by symlinking
dial.html under the expected production base_url.  When the bug is
resolved, the symlink can be removed.

¹ simonw/datasette#838
² simonw/datasette#1238
@tsibley
Copy link
Author

tsibley commented Mar 10, 2021

@simonw Unfortunately this issue as I reported it is not actually solved in version 0.55.

Every link which is returned by the Datasette.absolute_url method is still wrong, because it uses the request URL as the base. This still includes the suggested facet links and pagination links.

What I wrote originally still stands:

Although many of the URLs in the pages are correct (presumably because they either use absolute paths which include base_url or relative paths), the faceting and pagination links still use fully-qualified URLs pointing at http://localhost:8001.

I looked into this a little in the source code, and it seems to be an issue anywhere request.url or request.path is used, as these contain the values for the request between the frontend (Apache) and backend (Datasette) server. Those properties are primarily used via the path_with_… family of utility functions and the Datasette.absolute_url method.

Would you prefer to re-open this issue or have me create a new one?

@simonw
Copy link
Owner

simonw commented Mar 10, 2021

Let's reopen this.

@simonw simonw reopened this Mar 10, 2021
@simonw
Copy link
Owner

simonw commented Mar 10, 2021

The biggest challenge here I think is to replicate the exact situation here this happens in a Python unit test. The fix should be easy once we have a test in place.

tsibley added a commit to seattleflu/switchboard that referenced this issue Mar 10, 2021
Restores to the previous state before I removed it a little
too-optimistically in "Upgrade Datasette" (ee654fb).  Small change to
come in a separate commit for visibility.

I thought the bad faceting and pagination links¹ were fixed with
Datasette 0.55, but after further testing, it turns out they're not.²

¹ simonw/datasette#838
² simonw/datasette#838 (comment)
@tsibley
Copy link
Author

tsibley commented Mar 10, 2021

Nod. The problem with the tests is that they're ignoring the origin (hostname, port) of links. In a reverse proxy situation, the frontend request origin is different than the backend request origin. The problem is Datasette generates links with the backend request origin.

@tsibley
Copy link
Author

tsibley commented Mar 10, 2021

I think this could be solved by one of:

  1. Stop generating absolute URLs, e.g. ones that include an origin. Relative URLs with absolute paths are fine, as long as they take base_url into account (as they do now, yay!).
  2. Extend base_url to include the expected frontend origin, and then use that information when generating absolute URLs.
  3. Document which HTTP headers the reverse proxy should set (e.g. the X-Forwarded-* family of conventional headers) to pass the frontend origin information to Datasette, and then use that information when generating absolute URLs.

Option 1 seems like the easiest to me, if you can get away with never having to generate an absolute URL.

@simonw simonw closed this as completed Nov 20, 2021
simonw added a commit that referenced this issue Nov 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants