Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

base_url configuration setting #394

Closed
simonw opened this issue Jan 5, 2019 · 27 comments
Closed

base_url configuration setting #394

simonw opened this issue Jan 5, 2019 · 27 comments

Comments

@simonw
Copy link
Owner

simonw commented Jan 5, 2019

I've identified a couple of use-cases for running Datasette in a way that over-rides the default way that internal URLs are generated.

  1. Running behind a reverse proxy. I tried running Datasette behind a proxy and found that some of the generated internal links incorrectly referenced http://127.0.0.1:8001/fixtures/... - when they should have been referencing http://my-host.my-domain.com/fixtures/... - this is a problem both for links within the HTML interface but also for the toggle_url keys returned in the JSON as part of the facets datastructure.
  2. I would like it to be possible to host a Datasette instance at e.g. https://www.mynewspaper.com/interactives/2018/election-results/ - either through careful HTTP proxying or, once Datasette has been ported to ASGI, by mounting a Datasette ASGI instance deep within an existing set of URL routes.

I'm going to add a url_prefix configuration option. This will default to "", which means Datasette will behave as it does at the moment - it will use / for most URL prefixes in the HTML version, and an absolute URL derived from the incoming Host header for URLs that are returned as part of the JSON output.

If url_prefix is set to another value (either a full URL or a path) then this path will be appended to all generated URLs.

@simonw
Copy link
Owner Author

simonw commented Jan 6, 2019

I found a really nice pattern for writing the unit tests for this (though it would look even nicer with a solution to #395)

@pytest.mark.parametrize("prefix", ["/prefix/", "https://example.com/"])
@pytest.mark.parametrize("path", [
    "/",
    "/fixtures",
    "/fixtures/compound_three_primary_keys",
    "/fixtures/compound_three_primary_keys/a,a,a",
    "/fixtures/paginated_view",
])
def test_url_prefix_config(prefix, path):
    for client in make_app_client(config={
        "url_prefix": prefix,
    }):
        response = client.get(path)
        soup = Soup(response.body, "html.parser")
        for a in soup.findAll("a"):
            href = a["href"]
            if href not in {
                "https://github.com/simonw/datasette",
                "https://github.com/simonw/datasette/blob/master/LICENSE",
                "https://github.com/simonw/datasette/blob/master/tests/fixtures.py",
            }:
                assert href.startswith(prefix), (href, a.parent)

@kevindkeogh
Copy link
Contributor

Hey was this ever merged? Trying to run this behind nginx, and encountering this issue.

@kevindkeogh
Copy link
Contributor

kevindkeogh commented Jun 7, 2019

Putting this here in case anyone else encounters the same issue with nginx, I was able to resolve it by passing the header in the nginx proxy config (i.e., proxy_set_header Host $host).

@jsfenfen
Copy link
Contributor

jsfenfen commented Nov 21, 2019

Hey @simonw is the url_prefix config option available in another branch, it looks like you've written some tests for it above? In 0.32 I get "url_prefix is not a valid option". I think this would be really helpful!

This would be really handy for proxying datasette in another domain's subdirectory I believe this will allow folks to run upstream authentication, but the links break if the url_prefix doesn't match.

I'd prefer not to host a proxied version of datasette on a subdomain (e.g. datasette.myurl.com b/c then I gotta worry about sharing authorization cookies with the subdomain, which I just assume not do, but...)

Edit: I see the wip-url-prefix branch, I may try with that 8da2db4

@terrycojones
Copy link

Agreed, this would be nice to have. I'm currently working around it in nginx with additional location blocks:


    location /datasette/ {
        proxy_pass         http://127.0.0.1:8001/;
        proxy_redirect     off;
        include proxy_params;
    }

    location /dna-protein-genome/ {
        proxy_pass         http://127.0.0.1:8001/dna-protein-genome/;
        proxy_redirect     off;
        include proxy_params;
    }

    location /rna-protein-genome/ {
        proxy_pass         http://127.0.0.1:8001/rna-protein-genome/;
        proxy_redirect     off;
        include proxy_params;
    }

The 2nd and 3rd above are my databases. This works, but I have a small problem with URLs like /rna-protein-genome?params.... that I could fix with some more nginx munging. I seem to do this sort of thing once every 5 years and then have to look it all up again.

Thanks!

@terrycojones
Copy link

Hmmm, wait, maybe my mindless (copy/paste) use of proxy_redirect is causing me grief...

@jsfenfen
Copy link
Contributor

FWIW I did a dumb merge of the branch here: https://github.com/jsfenfen/datasette and it seemed to work in that I could run stuff at a subdirectory, but ended up abandoning it in favor of just posting a subdomain because getting the nginx configs right was making me crazy. I still would prefer posting at a subdirectory but the subdomain seems simpler at the moment.

@terrycojones
Copy link

@simonw What about allowing a base url. The <base>....</base> tag has been around forever. Then just use all relative URLs, which I guess is likely what you already do. See https://www.w3schools.com/TAGs/tag_base.asp

@betatim
Copy link

betatim commented Mar 23, 2020

On mybinder.org we allow access to arbitrary processes listening on a port inside the container via a reverse proxy.

This means we need support for a proxy prefix as the proxy ends up running at a URL like /something/random/proxy/datasette/...

An example that shows the problem is https://github.com/psychemedia/jupyterserverproxy-datasette-demo. Launch directly into a datasette instance on mybinder.org with https://mybinder.org/v2/gh/psychemedia/jupyterserverproxy-datasette-demo/master?urlpath=datasette then try to follow links inside the UI.

@wragge
Copy link
Contributor

wragge commented Mar 23, 2020

This would also be useful for running Datasette in Jupyter notebooks on Binder. While you can use Jupyter-server-proxy to access Datasette on Binder, the links are broken.

Why run Datasette on Binder? I'm developing a range of Jupyter notebooks that are aimed at getting humanities researchers to explore data from libraries, archives, and museums. Many of them are aimed at researchers with limited digital skills, so being able to run examples in Binder without them installing anything is fantastic.

For example, there are a series of notebooks that help researchers harvest digitised historical newspaper articles from Trove. The metadata from this harvest is saved as a CSV file that users can download. I've also provided some extra notebooks that use Pandas etc to demonstrate ways of analysing and visualising the harvested data.

But it would be really nice if, after completing a harvest, the user could spin up Datasette for some initial exploration of their harvested data without ever leaving their browser.

@terrycojones
Copy link

I just updated #652 to remove a merge conflict. I think it's an easy way to add this functionality. I don't have time to do more though, sorry!

@simonw
Copy link
Owner Author

simonw commented Mar 23, 2020

Thanks very much @terrycojones - I'll see if I can finish it up from here.

@terrycojones
Copy link

@simonw You're welcome - I was just trying it out back in December as I thought it should work. Now there's a pandemic to work on though.... so no time at all for more at the moment. BTW, I have datasette running on several protein and full (virus) genome databases I build, and it's great - thank you! Hi and best regards to you & Nat :-)

@simonw simonw pinned this issue Mar 24, 2020
@simonw
Copy link
Owner Author

simonw commented Mar 24, 2020

I don't think I'll go with the <base> solution purely because it doesn't work with JSON APIs - and there are quite a few places where Datasette APIs return URLs (for things like toggling facets - e.g. suggested_facets on https://latest.datasette.io/fixtures/facetable.json?_labels=on&_size=0 )

The good news is that if you look at the templates almost all of the URLs have been generated in Python code: https://github.com/simonw/datasette/blob/a498d0fe6590f9bdbc4faf9e0dd5faeb3b06002c/datasette/templates/table.html - so it shouldn't be too hard to fix in Python. Ideally I'd like to fix this with as few template changes as possible.

@simonw simonw changed the title url_prefix config setting base_url configuritaion setting Mar 24, 2020
@simonw simonw changed the title base_url configuritaion setting base_url configuration setting Mar 24, 2020
@simonw
Copy link
Owner Author

simonw commented Mar 24, 2020

Here's the line I'm stuck on now:

url_csv = path_with_format(request, "csv", url_csv_args)

Tricky question: do I continue to rebuild URLs based on the incoming request (on the assumption that it has been modified to the new thing) or do I expect that I may still see un-prefixed incoming requests and need to change them?

If the incoming URL paths contain the prefix, at what point do I drop that so I can run the regular URL matching code?

@simonw
Copy link
Owner Author

simonw commented Mar 24, 2020

I'm going to assume that whatever is proxying to Datasette leaves the full incoming URL path intact, so I'm going to need to teach the URL routing code to strip off the prefix before processing the incoming request.

@simonw
Copy link
Owner Author

simonw commented Mar 24, 2020

That means I should teach AsgiRouter how to handle an optional prefix:

class AsgiRouter:
def __init__(self, routes=None):
routes = routes or []
self.routes = [
# Compile any strings to regular expressions
((re.compile(pattern) if isinstance(pattern, str) else pattern), view)
for pattern, view in routes
]
async def __call__(self, scope, receive, send):
# Because we care about "foo/bar" v.s. "foo%2Fbar" we decode raw_path ourselves
path = scope["path"]
raw_path = scope.get("raw_path")

@simonw
Copy link
Owner Author

simonw commented Mar 24, 2020

Actually I'll teach DatasetteRouter since that subclasses AsgiRouter but has access to a datasette instance (which it can read configuration values from):

datasette/datasette/app.py

Lines 750 to 753 in 298a899

class DatasetteRouter(AsgiRouter):
def __init__(self, datasette, routes):
self.ds = datasette
super().__init__(routes)

@simonw
Copy link
Owner Author

simonw commented Mar 24, 2020

OK, I have an implementation of this over in the base-url branch (see pull request #708) which is passing all of the unit tests.

Anyone willing to give it a quick test and see if it works for your particular use-case? You can install it with:

pip install https://github.com/simonw/datasette/archive/base-url.zip

Then you can run Datasette like this:

datasette fixtures.db --config base_url:/new-base/path/here/

@simonw simonw added this to the Datasette 0.39 milestone Mar 24, 2020
@terrycojones
Copy link

Hi Simon - I'm just (trying, at least) to follow along in the above. I can't try it out now, but I will if no one else gets to it. Sorry I didn't write any tests in the original bit of code I pushed - I was just trying to see if it could work & whether you'd want to maybe head in that direction. Anyway, thank you, I will certainly use this. Comment back here if no one tried it out & I'll make time.

@simonw
Copy link
Owner Author

simonw commented Mar 25, 2020

I got this working as a proxied instance inside Binder, building on @psychemedia's work: simonw/jupyterserverproxy-datasette-demo#1

Now that I've seen it working there I'm going to land the pull request.

@simonw simonw closed this as completed in 7656fd6 Mar 25, 2020
@simonw simonw unpinned this issue Mar 25, 2020
@simonw
Copy link
Owner Author

simonw commented Mar 25, 2020

Shipped in 0.39: https://datasette.readthedocs.io/en/latest/changelog.html#v0-39

@terrycojones
Copy link

Great - thanks again.

@wragge
Copy link
Contributor

wragge commented Mar 26, 2020

Thanks! I'm trying to launch Datasette from within a notebook using the jupyter-server-proxy and the new base_url parameter. While the assets load ok, and the breadcrumb navigation works, the facet links don't seem to use the base_url. Or have I missed something?

My test repository is here: https://github.com/wragge/datasette-test

simonw added a commit that referenced this issue Mar 26, 2020
* base_url configuration setting
* base_url works for static assets as well
simonw added a commit that referenced this issue Apr 2, 2020
* base_url configuration setting
* base_url works for static assets as well
@LVerneyPEReN
Copy link

Hi,

I came across this issue while looking for a way to spawn Datasette as a SQLite files viewer in JupyterLab. I found https://github.com/simonw/jupyterserverproxy-datasette-demo which seems to be the most up to date proof of concept, but it seems to be failing to list the available db (at least in the Binder demo, https://hub.gke.mybinder.org/user/simonw-jupyters--datasette-demo-uw4dmlnn/datasette/, I only have :memory).

Does anyone tried to improve on this proof of concept to have a Datasette visualization for SQLite files?

Thanks!

@wragge
Copy link
Contributor

wragge commented Jun 10, 2020

There's a working demo here: https://github.com/wragge/datasette-test

And if you want something that's more than just proof-of-concept, here's a notebook which does some harvesting from web archives and then displays the results using Datasette: https://nbviewer.jupyter.org/github/GLAM-Workbench/web-archives/blob/master/explore_presentations.ipynb

@LVerneyPEReN
Copy link

Hi @wragge,

This looks great, thanks for the share! I refactored it into a self-contained function, binding on a random available TCP port (multi-user context). I am using subprocess API directly since the %run magic was leaving defunct process behind :/

image

import socket

from signal import SIGINT
from subprocess import Popen, PIPE

from IPython.display import display, HTML
from notebook.notebookapp import list_running_servers


def get_free_tcp_port():
    """
    Get a free TCP port.
    """
    tcp = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    tcp.bind(('', 0))
    _, port = tcp.getsockname()
    tcp.close()
    return port


def datasette(database):
    """
    Run datasette on an SQLite database.
    """
    # Get current running servers
    servers = list_running_servers()

    # Get the current base url
    base_url = next(servers)['base_url']

    # Get a free port
    port = get_free_tcp_port()

    # Create a base url for Datasette suing the proxy path
    proxy_url = f'{base_url}proxy/absolute/{port}/'

    # Display a link to Datasette
    display(HTML(f'<p><a href="{proxy_url}">View Datasette</a> (Click on the stop button to close the Datasette server)</p>'))

    # Launch Datasette
    with Popen(
        [
            'python', '-m', 'datasette', '--',
            database,
            '--port', str(port),
            '--config', f'base_url:{proxy_url}'
        ],
        stdout=PIPE,
        stderr=PIPE,
        bufsize=1,
        universal_newlines=True
    ) as p:
        print(p.stdout.readline(), end='')
        while True:
            try:
                line = p.stderr.readline()
                if not line:
                    break
                print(line, end='')
                exit_code = p.poll()
            except KeyboardInterrupt:
                p.send_signal(SIGINT)

Ideally, I'd like some extra magic to notify users when they are leaving the closing the notebook tab and make them terminate the running datasette processes. I'll be looking for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants