-
Notifications
You must be signed in to change notification settings - Fork 1.7k
share data between uwsgi workers #1813
Comments
I have a doubt that we really need shared states for the main use of searx / beside stats and "future features" each searx request is stateless and complete isolated. We should not give up this simplicity of searx. Requirements such as these disproportionately inflate searx. Sorry that I nitpicking about suggestions like this, but I want first see simple solution to our most pressing problems. It is a bit OT and not fair from my side, but I would remind about #1785 and https://github.com/dalf/searx-stats2/issues/7 ;) |
The purpose here is not to fall into the second-system effect or add complexity / too dependencies, but if possible find the right tool to share data between the workers. I've included redis because it's the common solution, but it doesn't fit with the searx spirit. Let's review the items:
The second purpose of this issue is at least to list what would require a global state, and maybe we can decide to do something about it at some point. For each of the items of the above list, it would been easier to have a global state: actually, one of the reason this issue exists is the Python GIL. To make it more clear: if searx would have been written in golang this issue wouldn't exist. async code and asyncio can help one that, see #1724 (even if it recommended to use multiple workers, so...). Out of topic: What is searx from the networt point of view ? IHMO, searx is weird from this aspect: there are multiple HTTP(S) pool (one per process), so there is no global outgoing request rate throttling. Is it far-fetched to say that this project https://github.com/unixfox/proxy-ripv6 is related to this issue ? Another example: the a.searx.space server is too slow (Kimsufi.fr KS-1). So if I search for something I will get a timeout from the bing/whatever engine. On the second request, since the connection is already done it will work. But since there are multiple connection pools, I have to click "search" two or three times (until I use an "initialized" HTTP connection pool). Here, the problem is the server (the CPU performance). A browser has multiple windows, tabs but one connection pool when using HTTP/2 |
My practical experience with building up networked services is limited. My experiences are rather atomic and I try to make an assessment with that. What I ask my self is: what are we doing when we share states .. here are my 2cent ..
Is it so? .. as far I can say: it depends on the infrastructure where searx (workers) are placed, e.g. for load balancing and that brings us to the subject at hand. If we really need global state (you named reasons having a practical value), something like a global state service might be the answer. Such a global state service would have to be developed with respect for privacy, which is not an easy task. There will always be requirements that are not feasible in terms of privacy. For example, is it okay to store and share a session context globally for a common group of outgoing request? .. I won't go into details, but I guess the answer is: "it depends" If we do so and searx workers are sharing outgoing sessions, the next requirement come up: outgoing IP management is needed. If I sum up my thoughts, I have to say that we are talking about searx networks or better name it searx cluster. Is my conclusion correct / .. thoughts? |
How I "feel" the implementation:
It allows to have an review of what is shared. But if searx switches to asyncio code, the global state would be some Python variables, if and only if one worker is used. So we need to do some measures / benchmarks of #1724 (asyncio code) or updated version ( #1703 (comment) ). If we need more than one worker, then this issue is still opened even with asyncio. Also note of PR #1724 :
About the HTTP connections (not session), it is already the intent: #192 even if uwsgi breaks this.
Except the embeddeding searx-checker issue (?), yes !
Yes. |
Why
Currently data between uwsgi workers are not shared.
It's include:
How ?
I don't have an answer to that question. Most probably this issue is already solved somewhere in a way I don't think about.
uwsgi: SharedArea
#415 suggests to use SharedArea.
It's an array of bytes, so an abstraction must be built on top of it.
An very API simple can be built:
If the structure is the same for all workers, it will work as expected. An checksum of the allocation can be added at the begin of the structure, so all workers can make sure they don't corrupt the data. Compare to mmappickle (see below), it would be way faster.
uwsgi: Caching Framework
I've tried the Caching Framework: it works as intended, and allows to share data between the workers.
Without any abstraction it includes a hard dependency on uwsgi.
Note if we look in the future (?): I haven't seen similar feature with asgi servers
uwsgi: Signal
I've tried uwsgi Signals: it doesn't seems possible to send an Signal to all the workers despite what says the documentation.
Signal could be use to store the data somewhere than asks all other workers to read the update.
multiprocessing.shared_memory
The python package multiprocessing.shared_memory allows to share data between processes.
Unfortunately it's only works from Python 3.8.
The github project SleepProgger/py_shared_memory extracts this feature, but it doesn't compile on Python 3.6, but only from the version 3.7: _PyArg_ParseStackAndKeywords has been renamed, but that's not the only thing to make it work. Anyway if we can avoid to compile C code it would be nice.
mmap file
https://docs.python.org/3.0/library/mmap.html
Same comment as the SharedArea.
The "simple" API (as described in the ShareArea section but applied to mmap) could be a way to solve this issue without an additional server .... or perhaps the performance wouldn't be good.
mmappickle
The github project UniNE-CHYN/mmappickle provides a Python dict over a mmap.
Damn slow.
multiprocessing.Pipe
If searx creates the processes by itself, multiprocessing.Pipe can be used.
lmdb
https://lmdb.readthedocs.io/en/release/
From #967
posix_ipc
https://github.com/osvenskan/posix_ipc
Provide lock and shared memory using the POSIX API (so some C code has to be compiled).
external server: Redis
This could be optional: the default configuration wouldn't share anything, but once redis is configured data are read & write from / to it.
I think that's the usual way to solve this issue.
The text was updated successfully, but these errors were encountered: