Server stream progress updates for RPC calls that have long running async operations #264

tegefaulkes · 2024-08-23T05:57:03Z

Specification

In the case of vaults clone we have a unary call. The call will be active for the duration of the cloning process which in the background is a compound operation of complex streams. For a very large vault we run into a problem with the client level unary call timing out before we can complete the call.

To fix this we need to change the unary call to a server streaming call. We then need to stream over progress updates to reset the timeout timer to prevent the time out. In the case of vaults clone this means sending over the cloning process periodically. Either every 5% or every few seconds or so. Then when the clone is complete we send over details such as the vault name and the new vault ID for that cloned vault. For most cases we can't really know what the absolute progress is. But we can output the amount of arbitrary progress made. This progress must be output on stderr like any other feedback output.

This change to streaming a progress updates need to apply to and client RPC call that waits for a complex or long running task. As far as I can tell this just applies to the vaults cloning, pulling and cross signing claims. But we'll need to investigate this deeper.

Additional context

Related: Polykey-CLI#185 - The client call timing out is very likely leading to an orphaned agent-agent RPC call and leaking a resource. This is likely what's causing the agent to lock up when shutting down.
Related: Polykey-CLI#198 - This also later triggers the agent crashing when the QUIC connection times out.
Polykey-CLI#74
js-rpc#52
js-rpc#57

Tasks

Identify all client RPC calls that wrap complex long running operations
Refactor these RPC calls to be a server stream call that sends over progress updates. The details of the progress isn't actually important so much that they reset the timeout for the call.

The text was updated successfully, but these errors were encountered:

linear · 2024-08-23T05:57:05Z

ENG-397 Server stream progress updates for RPC calls that have long running async operations

CMCDragonkai · 2024-08-23T10:14:35Z

Basically we need to stream progress to STDERR to avoid a ErrorRPCTimeout, but those bugs #185 and #198 need to be fixed too.

CMCDragonkai · 2024-08-23T10:17:48Z

@tegefaulkes mentioned that the git cloning process doesn't have an abort system. So if we timeout, it's likely the cloning continues, and the connection is leaked.

CMCDragonkai · 2024-08-24T06:10:03Z

This problem was actually discussed earlier here #74 and also connected to MatrixAI/js-rpc#52 and MatrixAI/js-rpc#57.

@tegefaulkes you even said:

Any commands that attempt a connection such as nodes add or vaults clone could take a long time to complete, especially if we fail to connect. The RPC timeout is 15 seconds with the connection timeout being the same 15 seconds but could take multiple 15 second attempts before giving up. Worst case is about 20/3 attempts to fail a nodesFind. This will cause a RPC timeout before we can respond with the actual error.

However using the grace timer solution is not sufficient. As in MatrixAI/js-rpc#59.

That assumed that the agent side would be able to tell the client side a richer error if the client side timed out.

However we understand that some operations like vaults clone might just take a long time. In which case, timeouts like the regular 15 seconds shouldn't apply like this. Grace timer can still work in the case of true unary operations. But vaults cloning and pulling are not. They are like download operations that may take arbitrary time. So that's why we can identify these operations as something requiring a streaming call instead.

tegefaulkes · 2024-08-26T00:44:02Z

I also need to note that the RPC handlers need to properly handle CTX and abortion properly.

aryanjassal · 2024-12-10T07:49:14Z

https://isomorphic-git.org/docs/en/onProgress

isomorphic-git supports a progress callback natively, so we can leverage that. We can show a spinner for operations without a set end point, and have a progress bar for operations providing a set end point.

$ polykey vaults pull top-secret
⠟ Pulling vault "top-secret"

$ polykey vaults clone ...
Cloning vault "important"
[====>------] 51%

A simple callback can send the relevant data, and the agent can differentiate between progress packets and response packets.

// On the client
const onValue = (event) => {
  yield { 
    progress: event.loaded,
    total: event.total,
  };
}

// On the agent
for await (const result of rpcMethod) {
  if (result.type === 'Progress') handleProgress(result.progress);
  else handleResponse(result.response);
}

CMCDragonkai · 2024-12-10T19:14:39Z

No it has to be inband to the RPC call. You may want to make a richer protocol but that makes incompatible with external git protocols.

CMCDragonkai · 2024-12-10T19:15:19Z

An alternative is a concurrent RPC stream.

aryanjassal · 2024-12-11T00:18:40Z

Well, even if we do make a richer protocol for this feedback, won't it only differ in the data sent back? If that is the only difference, can't I just use the progress event to generate data and send it over the richer protocol?

For example, if our protocol requires an update to be sent every 500ms, then I can have a promise reading data from a data structure in the background, and update that data structure whenever there's new data, so the updates are being sent every 500ms but we are still using the data provided by git.

CMCDragonkai · 2024-12-15T20:47:46Z

From an architectural perspective, the leaking of streams/connections caused by these sorts of IO is precisely a common factor in all cases where the agent fails to stop gracefully, or when a connection/stream leak occurs resulting in memory leaks. This is absolutely a resource control IO problem, and it is not an easy thing to solve, because this is not a "bug", but a design problem, you must think very carefully about what you're doing and ensure that any solution scales to all abstraction levels and scenarios.

aryanjassal · 2025-01-14T02:56:02Z

I had another idea to help avoid, or at least reduce, timeouts. If someone doesn't have fast internet and the timeout is occuring due to the client not being able to upload data fast enough, then perhaps instead of timing out and aborting the stream, an optional callback can handle the timeout and attempt to re-transmit the data with a smaller packet size.

This idea might become redundant if we implement something like concurrent streams or progress updates for streamed data, but this concept could still be useful for unary and server handlers where a potentially large amount of data is sent over from the caller to the handler in a single message.

I had this idea as I was testing something which involved sending over around 1MiB of data, or 1024KiB, in one single RPC message. While my internet is capable enough to sustain this, some places might not have this, and the RPC could time out. I was attempting to upload a small PDF, but if a larger file is uploaded, the network could become the bottleneck.

I'm aware that in this case, the bottleneck was more than likely the server rather than the network, but this could be something we want to consider for the future.

CMCDragonkai · 2025-01-14T04:07:19Z

Well vaults aren't supposed to be big anyway. But our quic code needs some concurrency fix, it's bandwidth is quite slow. So really the most important low hanging fruit should be targetted rather than trying to implement a complicated progress system.

tegefaulkes added the development Standard development label Aug 23, 2024

tegefaulkes mentioned this issue Oct 18, 2024

Review and refactor RPC handlers to handle cancellation properly MatrixAI/Polykey#832

Open

CMCDragonkai mentioned this issue Dec 5, 2024

Add cancellation to RPC handlers for the vaults domain MatrixAI/Polykey#846

Draft

CMCDragonkai assigned aryanjassal Dec 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server stream progress updates for RPC calls that have long running async operations #264

Server stream progress updates for RPC calls that have long running async operations #264

tegefaulkes commented Aug 23, 2024 •

edited by aryanjassal

Loading

linear bot commented Aug 23, 2024

CMCDragonkai commented Aug 23, 2024

CMCDragonkai commented Aug 23, 2024

CMCDragonkai commented Aug 24, 2024 •

edited

Loading

tegefaulkes commented Aug 26, 2024

aryanjassal commented Dec 10, 2024 •

edited

Loading

CMCDragonkai commented Dec 10, 2024

CMCDragonkai commented Dec 10, 2024

aryanjassal commented Dec 11, 2024 •

edited

Loading

CMCDragonkai commented Dec 15, 2024

aryanjassal commented Jan 14, 2025

CMCDragonkai commented Jan 14, 2025

Server stream progress updates for RPC calls that have long running async operations #264

Server stream progress updates for RPC calls that have long running async operations #264

Comments

tegefaulkes commented Aug 23, 2024 • edited by aryanjassal Loading

Specification

Additional context

Tasks

linear bot commented Aug 23, 2024

CMCDragonkai commented Aug 23, 2024

CMCDragonkai commented Aug 23, 2024

CMCDragonkai commented Aug 24, 2024 • edited Loading

tegefaulkes commented Aug 26, 2024

aryanjassal commented Dec 10, 2024 • edited Loading

CMCDragonkai commented Dec 10, 2024

CMCDragonkai commented Dec 10, 2024

aryanjassal commented Dec 11, 2024 • edited Loading

CMCDragonkai commented Dec 15, 2024

aryanjassal commented Jan 14, 2025

CMCDragonkai commented Jan 14, 2025

tegefaulkes commented Aug 23, 2024 •

edited by aryanjassal

Loading

CMCDragonkai commented Aug 24, 2024 •

edited

Loading

aryanjassal commented Dec 10, 2024 •

edited

Loading

aryanjassal commented Dec 11, 2024 •

edited

Loading