Add HTTP webserver interface #6

15532th · 2024-08-29T17:51:34Z

This PR implements enhancements discussed in #1, namely serving token as a JSON endpoint. It is meant to be a starting point for discussing the HTTP interface, specifically endpoints names and responses formats, so all the new code is placed into a separate file to avoid affecting any existing functionality.

Token generation runs independently from the webserver process, launching an update every 12 hours by default. Webserver uses port 8080 (configurable), with the latest token value being available on the /token endpoint:

$ curl 127.0.0.1:8080/token
{"updated": 1724445426, "potoken": "...", "visitor_data": "..."}

Additionally, it is possible to initiate immediate unscheduled token generation by accessing /update endpoint. While handy for testing, this feature seems to add more complications than it's worth, so it probably should be removed, unless it finds some practical use.

unixfox · 2024-09-16T21:06:42Z

thank you for this @15532th. I think it would be greatly appreciated if you could integrate it inside the main script. this way we can keep the logic common between the server web API and the one shot script.

15532th · 2024-09-17T13:26:07Z

Updated the main script. Output on successful token retrieval should be the same, however it might print different messages on errors.

I also fixed a few issues I came across while updating it, see individual commit messages for comments on the changes.

unixfox · 2024-09-17T14:13:16Z

@WardPearce Could you help me review? Thank you :)

WardPearce · 2024-09-17T14:15:27Z

@WardPearce Could you help me review? Thank you :)

more then happy to, I'll have a look over tmrw

WardPearce · 2024-09-17T14:45:30Z

@unixfox @15532th
Is their any reason why we couldn't use a micro web framework like litestar, starlette, fastapi etc? Would be more readable & provide additional functionality like automatic API documentation etc etc. (wsgiref itself doesn't appear to be considered production ready too & is a wrapper of https://docs.python.org/3/library/http.server.html)

Wouldn't updating Po tokens be better done using a in memory cache, what will expire after X seconds. Meaning that after every hour or so the request is slightly slower or possibly using aiocron (https://github.com/gawel/aiocron) if we must schedule in the background.

15532th · 2024-09-17T16:29:29Z

@WardPearce
I do have a couple of reasons I don't want to do this. Firstly, for an endpoint that is going to be accessed by a single client sitting on the same machine, making at worst one request per hour, it feels like an overkill to use production-ready frameworks capable of handling hundreds of RPS to achieve the same result as a few dozens of lines of pure Python code. Same goes for using in-memory cache to hold a couple of short strings (for all the added resource overhead and complexity it would only make things slower, I believe) and dealing with a cron-like interface to achieve something that can be done by a single coroutine with a while loop.

The second reason is that the typical environment I expect myself to end up using this token-retrieving functionality in, which is the end user machine, often running Windows. I don't want to ask my users to deal with complicated dependencies installation or setting up and running a docker image. In the current state of the token generator it only has a single pure Python dependency, which makes it easy to publish on pypi or package as an executable without running into obscure cross-platform compatibility issues (I am leaving out the biggest dependency, the Chrome itself, because a user machine is quite likely going to have a fresh version installed).

Regarding the "run update inside the Nth requests" vs "do updates in background" choice, my reasons for deciding on the latter were the following: while the usual delay on a fast machine would be only a few seconds, the worst case is currently a 30 seconds timeout, which some clients might not be prepared to handle, since it is not a delay you would normally expect from a web-service running on localhost. And if the issue that caused the timeout was only temporary, it means client has received an unnecessarily error while it could have been operating fine if it was served the old cached version of the token until a new update succeeds.

potoken.py

unixfox · 2024-09-17T16:42:50Z

The point in having an API is to fetch the po_token from an external tool that is not designed to do it by itself.

A po_token currently works for an indefinite amount of time, it's not a token that will expire after 1 hour. So serving a po_token generated from one hour ago is perfectly fine.

I do share the same vision as @15532th, this API won't be used at high scale and won't be accessible from the whole internet. Worst case scenario, someone runs 10 Invidious processes, it's just 10 requests done per hour, it's meaningless.

potoken.py

Turns out TemporaryDirectory initiates cleanup on interpreter exit even when it is used without context manager, which might interfere with nodriver trying to delete the directory at the same time.

It might happen if loaded page does not contain a video player or fails to load at all. Explicitly specify 10 seconds timeout for the selector, wrap it in a try-catch block, add separate error messages for the selector and handler timeouts.

It allows client code to control logging format and level when the module is used as a library and not a standalone script.

If connection to the launched browser gets interrupted or the browser itself exits abruptly before the video tab got loaded, the update procedure might hang up indefinitely.

Remove unnecessarily f-strings and double quotes, fix a few other issues.

It leaves less time for client to hit 503 error on startup, but more importantly, it deals with race condition in startup sequence: if extractor_task gets interrupted after webserver startup was initiated but before make_server() returned, then potoken_server.stop() becomes a no-op, leaving background task with webserver to run indefinitely.

…m binary

15532th force-pushed the webserver branch from 943a5bf to 260db3e Compare September 17, 2024 13:20

unixfox requested changes Sep 17, 2024

View reviewed changes