Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HTTP webserver interface #6

Merged
merged 16 commits into from
Nov 17, 2024
Merged

Add HTTP webserver interface #6

merged 16 commits into from
Nov 17, 2024

Conversation

15532th
Copy link
Contributor

@15532th 15532th commented Aug 29, 2024

This PR implements enhancements discussed in #1, namely serving token as a JSON endpoint. It is meant to be a starting point for discussing the HTTP interface, specifically endpoints names and responses formats, so all the new code is placed into a separate file to avoid affecting any existing functionality.

Token generation runs independently from the webserver process, launching an update every 12 hours by default. Webserver uses port 8080 (configurable), with the latest token value being available on the /token endpoint:

$ curl 127.0.0.1:8080/token
{"updated": 1724445426, "potoken": "...", "visitor_data": "..."}

Additionally, it is possible to initiate immediate unscheduled token generation by accessing /update endpoint. While handy for testing, this feature seems to add more complications than it's worth, so it probably should be removed, unless it finds some practical use.

@unixfox
Copy link
Member

unixfox commented Sep 16, 2024

thank you for this @15532th. I think it would be greatly appreciated if you could integrate it inside the main script. this way we can keep the logic common between the server web API and the one shot script.

@15532th
Copy link
Contributor Author

15532th commented Sep 17, 2024

Updated the main script. Output on successful token retrieval should be the same, however it might print different messages on errors.

I also fixed a few issues I came across while updating it, see individual commit messages for comments on the changes.

@unixfox
Copy link
Member

unixfox commented Sep 17, 2024

@WardPearce Could you help me review? Thank you :)

@WardPearce
Copy link

@WardPearce Could you help me review? Thank you :)

more then happy to, I'll have a look over tmrw

@WardPearce
Copy link

@unixfox @15532th
Is their any reason why we couldn't use a micro web framework like litestar, starlette, fastapi etc? Would be more readable & provide additional functionality like automatic API documentation etc etc. (wsgiref itself doesn't appear to be considered production ready too & is a wrapper of https://docs.python.org/3/library/http.server.html)

Wouldn't updating Po tokens be better done using a in memory cache, what will expire after X seconds. Meaning that after every hour or so the request is slightly slower or possibly using aiocron (https://github.com/gawel/aiocron) if we must schedule in the background.

@15532th
Copy link
Contributor Author

15532th commented Sep 17, 2024

@WardPearce
I do have a couple of reasons I don't want to do this. Firstly, for an endpoint that is going to be accessed by a single client sitting on the same machine, making at worst one request per hour, it feels like an overkill to use production-ready frameworks capable of handling hundreds of RPS to achieve the same result as a few dozens of lines of pure Python code. Same goes for using in-memory cache to hold a couple of short strings (for all the added resource overhead and complexity it would only make things slower, I believe) and dealing with a cron-like interface to achieve something that can be done by a single coroutine with a while loop.

The second reason is that the typical environment I expect myself to end up using this token-retrieving functionality in, which is the end user machine, often running Windows. I don't want to ask my users to deal with complicated dependencies installation or setting up and running a docker image. In the current state of the token generator it only has a single pure Python dependency, which makes it easy to publish on pypi or package as an executable without running into obscure cross-platform compatibility issues (I am leaving out the biggest dependency, the Chrome itself, because a user machine is quite likely going to have a fresh version installed).

Regarding the "run update inside the Nth requests" vs "do updates in background" choice, my reasons for deciding on the latter were the following: while the usual delay on a fast machine would be only a few seconds, the worst case is currently a 30 seconds timeout, which some clients might not be prepared to handle, since it is not a delay you would normally expect from a web-service running on localhost. And if the issue that caused the timeout was only temporary, it means client has received an unnecessarily error while it could have been operating fine if it was served the old cached version of the token until a new update succeeds.

potoken.py Outdated Show resolved Hide resolved
@unixfox
Copy link
Member

unixfox commented Sep 17, 2024

The point in having an API is to fetch the po_token from an external tool that is not designed to do it by itself.

A po_token currently works for an indefinite amount of time, it's not a token that will expire after 1 hour. So serving a po_token generated from one hour ago is perfectly fine.

I do share the same vision as @15532th, this API won't be used at high scale and won't be accessible from the whole internet. Worst case scenario, someone runs 10 Invidious processes, it's just 10 requests done per hour, it's meaningless.

potoken.py Outdated Show resolved Hide resolved
potoken.py Outdated Show resolved Hide resolved
potoken.py Outdated Show resolved Hide resolved
potoken.py Outdated Show resolved Hide resolved
potoken.py Outdated Show resolved Hide resolved
potoken.py Outdated Show resolved Hide resolved
@15532th 15532th requested a review from unixfox September 18, 2024 12:25
potoken.py Outdated Show resolved Hide resolved
Turns out TemporaryDirectory initiates cleanup on interpreter exit even when it is used without context manager, which might interfere with nodriver trying to delete the directory at the same time.
It might happen if loaded page does not contain a video player or fails to load at all.

Explicitly specify 10 seconds timeout for the selector, wrap it in a try-catch block, add separate error messages for the selector and handler timeouts.
It allows client code to control logging format and level when the module is used as a library and not a standalone script.
If connection to the launched browser gets interrupted or the browser itself exits abruptly before the video tab got loaded, the update procedure might hang up indefinitely.
Remove unnecessarily f-strings and double quotes, fix a few other
issues.
It leaves less time for client to hit 503 error on startup, but more importantly, it deals with race condition in startup sequence: if extractor_task gets interrupted after webserver startup was initiated but before make_server() returned, then potoken_server.stop() becomes a no-op, leaving background task with webserver to run indefinitely.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants