-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increasing bandwidth of multi-device drivers #1499
Comments
@k-ross @jpaana @pawel-soja Can you please share your thoughts on this? |
Are you sure the pipe is the bottleneck? What happens once all the feeds are aggregated into a single stream to send to the client connected to the indiserver? Wouldn't that be even more of a bottleneck? What kind of throughput do we actually need? I did a quick test and measured the actual throughput of piping data through stdout, and I get approx. 2.6 Gbps, which is faster than most TCP/IP networks. This was on my Raspberry Pi, the slowest computer I own. :) The sender:
The receiver:
The result, on a Raspberry Pi 4:
If we need more throughput than this, my suggestion is to offload the incoming streams to separate threads in the indiserver, since I suspect any bottleneck is probably the processing of the incoming data, and not the actual transfer of the data. |
Thank you Kevin! Let's take an example:
Both of these drivers are connected to INDI servers via a bidirectional pipe (stdin/stdout). For QHY driver, each camera has its own thread when sending out the data to INDI server. If the pipe is not the bottleneck, then INDI server should be since it is single threaded. As far as client goes, in KStars, I actually open a separate client to INDI server for each BLOB, so each blob is acting as a separate client to INDI server. I believe the next step is to introduce multi-threading for that each driver is handled in its own thread, and each client is handled in its own thread as well. IIRC, @pawel-soja suggested we migrate INDI server to C++ as well which would make this even easier, and perhaps even achieve more cross-platform compatibility. |
Have you done any instrumentation to determine where the bottleneck is? Have you timed how long it takes in indiserver to read the data, vs timing how long it takes to process the data? Maybe the slowness is parsing the XML, doing base64 decoding (if you're doing that), or just moving the data around and making unnecessary copies. All could slow things down, and in a single threaded server, any one area of slowness will affect everything. The good news with single threaded servers, they are usually a lot easier to profile, and find the bottlenecks. :) |
You're right, let me create a benchmark we can check against. |
Hi I have been thinking to this long time ago : It would be natural to exchange blobs as filedescriptor over unix domain socket. The fds exhange is a special kind of message on that domain. Each fd would refernce an anonymous memory (memfd_create). So local clients can receive raw data from driver without the server or the kernel having to process them at all. Also multiple client processes would efficiently share the memory space AFAIK this method is used by kdbus for performance https://indilib.org/forum/development/5465-idea-for-a-zero-copy-blob-transport-protocol-in-indi.html i would be willing to help here... |
That would be awesome @pludov especially if we can also implement multi-threading in INDI server along the way. |
I'm not opposed to this idea, but I have some questions. Why send the FD over a Unix domain socket? Why not include the FD in the XML response? Does each blob get a new FD? Or is one shared memory space kept open and reused? If it's reused, how do we communicate locking back and forth between client and server, so server knows when client is done and can overwrite with new blob? I guess one approach is to allocate a large chunk of shared memory, and treat it like a ring buffer. Would this be significantly faster than sending the blob in binary form, instead of base64, over a TCP socket? Sending the blob in binary form over a TCP socket would allow non-local clients to also enjoy a performance boost. |
FD is a special information that pass from process to process, like a pointer to a buffer on kernel side. It has special handling from the kernel when exchanged between processes (over the AF_UNIX communication domain). you cannot just embed a number in the existing protocol (the fd actual value on the send will make no sens on the receiver). Each blob will get its own FD, the emitter is responsible for allocating it. The inter process passing and sharing are handled by the kernel. Basically, the kernel keeps the buffer as long as it is referenced by at least one process. To access actual buffer data, processes need to "mmap" it. In the case of the indiserver, this mmap is not required, so the cost of a blob exchange within the indiserver would not be related to the size of the blob : all messages would be almost equals in processing time on its side. FD exchange cannot be used over TCP connection (which are presumably not on the same host, so no way to share memory). We need to keep the actual base64 protocol to keep support for remote TCP. The Unix/shared buffer protocol would still benefit the indi_server and local client (no encoding/decoding here, faster message exchange, possibly no memory copy at all). It would require a "converter" at the TCP/remote endpoint for the support of remote connection. This converter can be a per client process/thread, that way we would get a fast indiserver, pure Unix/shared buffer, and a per connection thread for actual base64 encoding (which should always be enough, since one cpu doing base64 will certainly always be able to saturate a tcp connection). On the implementation side, I propose to implement in the following order:
I would go with an event loop library (libev) for indiserver and its "tcp connector" |
@pludov do you think you can work on some sort of proof of concept sample for this? I think if we can reduce the copying between drivers and indiserver on the same host that would bring tons of benefit. For remote clients & remote drivers (that are snooping on BLOBs), we should always fallback to TCP/Base64. Now this could also be extended to LOCAL clients. Perhaps they can even be aware of this feature and can support, but base64 should also be there to maintain compatibility. |
@knro : i'll create a tcp/unix message converter first because it will be required for any latter real testing. And the code will be the basis of the unix socket indiserver. Given my spare time availble, it will take some weesk... |
I started working on that, starting with C++ convertion of indiserver, here: https://github.com/pludov/indi/tree/fast-blob-transfer I'll add libev for eventloop and go on with implementing communication over unix datagram sockets Any idea why I need to add libs/ to the lilxml include ? master is not building on my ubuntu 18 box otherwise...
|
Excellent progress! Looking forward to seeing the first operational C++ INDI server. The lilxml issue is minor and probably we just need to add an include in CMakeLists.txt later on. |
I now have a locally working c++ indiserver based on libev. But the CI fails because libev is not installed on the build containers... I'd like to have this working to validate libev compatibility (esp on macos...) How can i fix this ? |
Is it possible to edit the yaml file to apt-get install that? |
The trick was to use the correct docker registry. Forks push to their owner's repo, but still uses container from the main repo for test. I fixed that. I am almost finished with c++ convertion & libev integration. It is working for my simple test setup (no remote drv, ...) Would you consider a merge request for this part, before I go on with testing datagram implementation ? |
A merge request is a bit early now. It is better to demonstrate the whole spectrum. I will checkout your branch and give you feedback once you let us know it's ready for first phase testing. |
Not ready for testing but I now have a working driver=>indiserver connection over unix socket with blob exchanged as shared buffers :-) It still inefficient as base64 encoding is done in the indiserver eventloop for all clients. I'll now to move that away from indiserver eventloop and allow local clients to receive blob as shared buffer. Optionally, drivers will then need (light) adaptation to write directly their blob to shared buffer (avoiding copy here as well) |
So this needs change in INDI::CCD? This shouldn't be hard to do. We also need to perhaps even include clients to gain from this, if possible. Looking forward to testing this and comparing the results. Edit: I know INDI::CCD is just for cameras, but this is where the bulk of BLOBs transfers are happening right now. This along with INDI::StreamManager |
I just pushed an implementation that can use shared buffer for blob delivery from driver to libindiclient (lots of FIXME but that path is OK). So I did some quick benchmark :-) I used the following "setup":
indi 1.9.1:
shared buffer version:
My interpretation:
Also that test does not measure the latency on the driver side. But with such a low cpu usage on indiserver, it will be able to keep up with a lot of drivers traffic :-) I didn't look at streammanager or INDI::CCD in details for now. As you see there is already an advantage for drivers without more change (overall cpu is divided by 2), but there is probably something done in streammanager that consume a large amount of cpu for copy/convertion/... (provided i disabled all the star/noise rendering code... just producing horizontal bars here !) |
Awesome progress! The CPU utilization is definitely much better but I expected we see a significant FPS boost from this as well since we avoided a lot of copying. What if you reduce the resolution? can you see more differences in FPS? how about with real cameras? |
I'll debug/profile ccd_simulator to know what takes that time (16-8 convertion ?).
The signature of ccdchip should change so that drivers produce a new buffer for every frame and transfer its ownership. This would kill needs for synchro. Currently the frame needs to be copied before next acquisition can start.
Then for simplest cases where no convertion is required, the buffer could be sent as is. Depending on hw (esp DMA), this would be cpu free on the driver side as well.
Le 23 août 2021 22:28:35 GMT+02:00, Jasem Mutlaq ***@***.***> a écrit :
Awesome progress! The CPU utilization is definitely much better but I expected we see a significant FPS boost from this as well since we avoided a lot of copying. What if you reduce the resolution? can you see more differences in FPS? how about with real cameras?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1499 (comment)
--
Envoyé de mon téléphone Android avec K-9 Mail. Excusez la brièveté.
|
The bottleneck in my scenario is GammaLut16::apply. It is however quite simple, I'll check the optimization level of the compiler... |
Since indi_simulator is not able to saturate one indiserver even before shared buffer optimization, the apparent FPS gain was limited. So i tried to optimize effectiveness of ccd_simulator for my benchmark. It appears that lots of frames are lost because of they reach deliver while the previous image delivery is still ongoing. (previewThreadPool.try_start) I changed it for start which to my understanding create a queue of max 1 item. Makes some difference indi 1.9.1 47M images (7680x6144) with previewThreadPool.start change:
shared buffer version, 47M images (7680x6144) with previewThreadPool.start change::
In indi_simulator, the GammaLut16 function is always the slowest point. And then most of the remaining time is spent doing vector copy or initialization : lots of std::vector moves actually end up doing memcpy or memset indeed. Also my current shared buffer has one more copy for shared buffer initialization, that could be optimized by preallocating buffers |
For that resolution, 15fps is impressive. This is also a good chance to optimize any bottlenecks or unnecessary copying in the pipeline. What's the next step? |
I have the following point to address before other could seriously test:
At that point testing will be welcome. Also a good time to start a code review The next step would be and move on to optimization of the driver / driver api side. But I think the indiserver should be merged before driver/driver api work occurs - as they are mostly independant |
Sure, we can merge INDI server. Please submit a PR and we'll test it thoroughly under multiple scenarios. I presumed you tested it with chained servers? Regarding MacOS, I can test on it, so no need to disable there if you need feedback. I have both Intel based and M1 Mac minis. Regarding transfer formats, in #1200 I proposed a way to decouple capture vs transfer formats. For streaming, we send either motion jpeg if natively supported by the source, or raw 8bit or 24bit bitmap or convert the source to jpeg and send that. |
I'll fix the four point above and test on my side (takes some time) before. By the way, I've found an existing bug as well about blob handling: #1528. Would be great to have some smoke test in CI about that. |
@pludov Wanted to check if there has been any updates on this feature? anything to test now? |
Unfortunately, lots of clear nights here kept me away from this....
Le 7 septembre 2021 08:26:37 GMT+02:00, Jasem Mutlaq ***@***.***> a écrit :
***@***.*** Wanted to check if there has been any updates on this feature? anything to test now?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1499 (comment)
--
Envoyé de mon téléphone Android avec K-9 Mail. Excusez la brièveté.
|
@pludov Any good news about this? |
Heya! Sorry to bother you again regarding this, but is there any update? Anything to test? |
Hi ! No progress so far... but still high on my todo list !
Le 20 octobre 2021 08:50:20 GMT+02:00, Jasem Mutlaq ***@***.***> a écrit :
Heya! Sorry to bother you again regarding this, but is there any update? Anything to test?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1499 (comment)
--
Envoyé de mon téléphone Android avec K-9 Mail. Excusez la brièveté.
|
Great, let me know if there is anything we can do to make this make it into 1.9.3. Can we break this into milestones that are easily trackable? |
Alright, so I moved this to 1.9.4 milestone. @pludov do you think we can get something ready by then? |
Hello @pludov I hope you're doing great. Just wanted to check in and see if there has been any progress here? What's the next step? |
Hi ! I am not so active on this atm unfortunately. But Still planning to get back at it !
I started adding a feedback for blob deluvery, because without this, nothing prevents blob from beeing queued within the fifo and as soon as a reader cannot read fast enough, the oom killer get invoked. Also in that case, memory for in transit blobs is not accounted to any process so the choice of the oom killer is not clever...
I will ensure that a blob emitter never queues more than one blob (or 2,3, configurable ), and wait for ack from the receiver ( only for local/unix socket, this would make no sensre on tcp)
Le 16 décembre 2021 07:09:49 GMT+01:00, Jasem Mutlaq ***@***.***> a écrit :
Hello @pludov I hope you're doing great. Just wanted to check in and see if there has been any progress here? What's the next step?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1499 (comment)
--
Envoyé de mon téléphone Android avec K-9 Mail. Excusez la brièveté.
|
I did move forward on that subject (you can see here: https://github.com/pludov/indi/ ) It's working on my test system with phd2 and mobindi. Still some todos here and here, especially in the qt-client. To ensure feeback for blob delivery I had to add something to the protocol, because it's not possible at lower level: without protocol change/addition, sender cannot get knowledge of what the reader has received, and so may fill the socket buffers with blobs and trigger oom. This comes from the fact that the actual message size in byte occupied within the kernel socket buffer does not reflect the memory used. It's easy with big blobs attached to have GB of RAM used in a small socket buffer (especially if the receiver is busy). And lowering socket buffer to extremely low value (like < 100 bytes) would badly impact performances. The idea I implemented is that a blob sender will not send a new blob while the previous one has not been acknowledge somehow. This pingRequest/Reply is used when sending blobs. Before sending a blob (no2), the sender will ensure the previous one(no1) has been acknowledged:
I think this mechanism is usefull only for using unix connection/shared blob. For existing connection it makes no difference as the blob size is largely bigger than buffering between client and server and no two blobs can be infly at the same time... Also, I felt the need to add some automated integration tests to indiserver and baseclient. The idea here is to start one process (indiserver or indiclient) and instrument (mock) its connections to test various behaviours (at the xml/protocol level). You can see what it looks like in the /integs directory. For example, when testing indiserver, ping helps in the following test :
The alternative is to insert a pingRequest/reply to ensure the state of the server after the enableBlob. The test becomes:
So I propose we add this "ping" mechanism to a new version of indi protocol to go on (with proper handling of previous version), for shared blobs and for integration tests. What do you think about it ? |
Hi @knro : What's your opinion on the proposal above ? Any chance it would be accepted upstream ? We can discuss live if you prefer... |
Thank you for the proposal, I need time to review it fully (just released INDI 1.9.4 yesterday and I should be able to look at it in depth this weekend). |
Sorry, been busier than usual. Will get back to this soon. I think it's going to be extremely challenging to change INDI wire protocol, but not impossible if it is kept backward compatible. |
@pludov Sorry it took me such a long time to respond to this, but now that INDI 1.9.6 is ready for release, I've been thinking of how we can incorporate this in 1.9.7 Let's take this scenario:
From my understanding, once the driver has the data, it can signal to the server that data is ready to be read (perhaps by a property?). The server then checks the clients. Since it has 3 clients, it checks how it sends the BLOB to them.
Driver cannot use shared memory yet (no queue?) until server and 1st client complete reading the shared memory. How is the driver informed about this so that it can start utilizing the memory again? Regarding the change in protocol, you propose a new message type and ? how is that exactly used? is it only between client and server, or also between server and driver? |
Hi Jasem,
It's been a while ! Maybe we should have a live talk about it, but here
is a written response attempt :
First a slight precision : There is no "shared memory" in my proposal.
The driver will generate a new buffer for each frame, and the buffer
reference is then passed over to the server and then the client if they
support it (or copyed/base64 for tcp connections).
What is exchanged is buffer reference, and this flies over using the
actual indi protocol when running over unix domain socket (using
filedescriptor attachment feature of the unix domain) - this replaces
the base 64 enc.
Using that scheme, access and synchronisation is ensured by the kernel
itself: accessing the memory of a buffer by mmap, releasing a buffer
once nobody uses it anymore (ref counted), ... there is no buffer reuse
to care about - the kernel will reuse whatever memory -, so the driver
doesn't have to implement any synchronization.
Also, having data in a kernel's FD object gives access to some highly
optimized function from the kernel : reading/writing content directly to
a file or device without even mapping the memory in the processes (ex:
DMA in the driver, using the
https://www.systutorials.com/docs/linux/man/2-sendfile/ or
https://man7.org/linux/man-pages/man2/copy_file_range.2.html syscalls).
In theory we could have on-disk capture implemented with the existing
driver/server/client architecture, with almost no CPU use on
system/drivers that have efficient DMA.
_Benefit to existing clients:_
Driver can benefit from shared buffers if server does so. This already
avoid two copies of the data + the tcp buffering latency (transfer
between server/driver). Since the buffer exchange is blazing fast, we
could also simplify most driver by removing the need for dedicated
"write" thread.
_Why a protocol change_:
One drawback I found is that in this buffer exchange scheme, when a
"blob reader" is slow (either a client or the server), buffers can
accumulate on its side, because the socket buffer is large compared to
the size of the messages exchanged. Lots of message can fit in that
buffer before beeing read by the client, and if they all have attached
blob buffer, this can makes up a large ammount of memory. This situation
can lead to system memory exhaustion if too many buffers / too big
buffers accumulate in this situation.
That's why there is a need to get some feedback after a blob has been
sent over this channel, so as the sender knows how many blob/memory
ammount is still in transit to the reader, and do some skipping if
necessary. I added a ping/reply mechanism that is just a way to get the
feedback about when a message has actually been read: the blob emitter
will send a ping request after a blob, and will not send anymore blob
until it received the associated reply. The ping request/reply is local
to the link (between driver/server or server/client) and is not propagated.
This ping request/reply mechanism also allows writing deterministic
integration test. I wrote tests that trigger specific case like "a
driver message arrived before a client registration" vs "a driver
message arrived after a client registration". From a CI/testing
perspective, without protocol synchronization it is not possile to know
in advance the order of processing by the server of two messages issued
on two differents channels. The best you can do is to insert sleep
instruction which ultimetly make your test non reliable (producing false
positives when running on loaded machine). Using this ping mechanism,
you can make sure a message is processed by the server at any time and
test exactly one scenario. I used that to write a non-regression test
suite for the rewritten indiserver
_Beeing transparent with existing client:_
The whole mechanism buffer exchange relies on using unix domain socket,
which garantees the peers of a connection are running on the same
system. No client today can use this kind of connection since server is
purely tcp up to now.
So in my proposal the new protocol message is initiated exclusively on
this kind of socket, not changing anything for existing clients. For
would-be client using the new connection mechanism, the ping is handled
at low level and would be transparent.
Ludovic
Le 21/05/2022 à 09:48, Jasem Mutlaq a écrit :
…
@pludov <https://github.com/pludov> Sorry it took me such a long time
to respond to this, but now that INDI 1.9.6 is ready for release, I've
been thinking of how we can incorporate this in 1.9.7
Let's take this scenario:
1. One driver
2. One Server
3. 1 local client with shared memory support.
4. 1 remote client with shared memory support.
5. 1 local client without shared memory support
From my understanding, once the driver has the data, it can signal to
the server that data is ready to be read (perhaps by a property?). The
server then checks the clients. Since it has 3 clients, it checks how
it sends the BLOB to them.
1. For first local client with shared memory support: No need to send
blob, but only signal to it that it is ready (perhaps by same
property mentioned above?)
2. For second remote client, it doesn't matter if it has shared
memory support, we need to send it via TCP sockets. So the server
must read the shared memory and then sends it.
3. Same as 2, server must read and send via TCP.
Driver cannot use shared memory yet (no queue?) until server and 1st
client complete reading the shared memory. How is the driver informed
about this so that it can start utilizing the memory again? Regarding
the change in protocol, you propose a new message type and ? how is
that exactly used? is it only between client and server, or also
between server and driver?
—
Reply to this email directly, view it on GitHub
<#1499 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADMXFHSYRMCQNMG636QLNBLVLCIOBANCNFSM47MMTRUA>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thank you Ludovic for the detailed response. I think we should arrange a video meeting to go over this, perhaps even with the participation of other INDI developers so we can agree on some consensus moving forward. I believe this would bring significant improvements especially if the driver/server/client are all on the same physical machine. Thanks for clarifying my misunderstandings about this and I probably have more! How about we arrange a meeting next week? how is your availability? You can send me a direct email to discuss the set up. |
This is now fully implemented in INDI thanks to outstanding work by @pludov! |
Currently some multi-device drivers (e.g. QHY) can run multiple cameras in a single instance connected to INDI server. The driver and server are connected via pipes (stdin/stdout). When a BLOB is sent from the driver to the server, a base64 encoded BLOB is sent via an INDI XML message. The server parses the BLOB and redirects it to interested clients.
A primary limitation of this method is the congestion of the pipes when multiple devices start sending BLOBs at the same time; this is especially evident with multiple large-sensor cameras begin streaming simultaneously. This is further exacerbated by a single-threaded INDI server.
There are a few possible solutions to this:
The text was updated successfully, but these errors were encountered: