Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancing Hardware Command Responses with Detailed Asynchronous Status Updates #251

Open
tiyash-basu-frequenz opened this issue Aug 8, 2024 · 12 comments
Assignees
Labels
part:protobuf Affects the protocol buffer definition files type:enhancement New feature or enhancement visitble to users
Milestone

Comments

@tiyash-basu-frequenz
Copy link
Contributor

tiyash-basu-frequenz commented Aug 8, 2024

What's needed?

We need a more informative way to get responses from hardware components after issuing commands to them.

Currently, the response is merely an acknowledgement and does not reflect the actual state or events at the component. Ideally, responses should include the statuses of all activities our stack performs on the component.

For example, the backend might increase power to 100kW in a 3-step ramp of 10kW, 50kW, and 100kW. Currently, the response would only acknowledge the 100kW command. If the 50kW command fails when sent to the inverter (e.g., due to sudden bounds change or communication errors), the client remains unaware and has to rely solely on the power value in the component data stream. This is suboptimal.

This may involve exposing more low-level information to the clients, so we need to make this optional. Clients who do not want this information should not be required to receive it.

Proposed solution

There are two ways of addressing this:

Proposal 1

Update the set-power command RPCs to return a stream of responses.

Proposal 2

Have a separate RPC that streams command response statuses. In this case, each command should be provided a unique ID from the service, and the streamed response status should contain this ID. (although one client should not get response for commands from another client).

Both proposals are technically feasible from the service's perspective. However, we need feedback from current users to determine their preferences and reasons.

Use cases

With the proposed changes, the client would receive a stream of real-time status updates for each command.

For example, after issuing the command to increase power, the client would receive updates indicating the progress and success or failure of each step in the process. If the command to write to the Modbus register succeeds but the component fails to respond within the expected timeframe, the client would be immediately informed of the issue, allowing for timely corrective actions such as retrying the command, diagnosing the communication problem, or adjusting their operational strategy based on the current status of the component.

Alternatives and workarounds

No response

Additional context

No response

@tiyash-basu-frequenz tiyash-basu-frequenz added part:❓ We need to figure out which part is affected priority:❓ We need to figure out how soon this should be addressed type:enhancement New feature or enhancement visitble to users labels Aug 8, 2024
@tiyash-basu-frequenz tiyash-basu-frequenz modified the milestones: v1.0.0, v0.18.0 Aug 8, 2024
@llucax llucax added part:protobuf Affects the protocol buffer definition files and removed part:❓ We need to figure out which part is affected priority:❓ We need to figure out how soon this should be addressed labels Aug 8, 2024
@llucax
Copy link
Contributor

llucax commented Aug 15, 2024

FYI @shsms

@llucax
Copy link
Contributor

llucax commented Aug 15, 2024

So seem to be leaning more towards option 2. Some advantages over option 1:

  • It should be less resources intensive
  • Not that we need to right now, but it is more backwards compatible if we keep the old rpc intact and users that need more/better replies can subscribe to the new stream.
  • If it is more convenient for the client to provide a reply stream per every request, we can still do it at the client level and create a python channel for every request instead of a rpc stream for every request.

@shsms
Copy link
Contributor

shsms commented Aug 16, 2024

leaning more towards option 2.

That sounds good to me.

If it is more convenient for the client to provide a reply stream per every request, we can still do it at the client level and create a python channel for every request instead of a rpc stream for every request.

This is not required for the SDK use cases. A single stream is easier.

@tiyash-basu-frequenz
Copy link
Contributor Author

Thanks @shsms
We will go for option 2 then.

@tiyash-basu-frequenz
Copy link
Contributor Author

In the past weeks, I was working on similar problem in our proprietary stack, and I discovered that having a global response stream with request IDs leads to a few issues.

The first issue is about complications on the server side - it needs to maintain a map of output streams to client, and response objects from for one client should not go into the output stream for another. This is additional overhead that could be avoided if each request just returns its own response stream.

The second issue is with request ID tracking on the client side - managing request IDs for multiple requests is not trivial. If the client has to care about the responses, then it needs code to track request IDs through timeouts and mix-ups, which is evidently complicated. With a response stream per request, clients have the option to use them straightaway, or to stream them to a global response stream if they need to.

So, I am inclining towards option 1.

@shsms
Copy link
Contributor

shsms commented Sep 6, 2024

I can't think of why the SDK would need to track the request ID. I think we just need the component ID for which a request failed, and the SDK can take action with just that. So from the SDK side, I think option 2 is easier to use.

With option 1 we'd probably need more code, but I don't think there would be any performance cost. So AFAICS, going with option 1 shouldn't be too bad either.

@tiyash-basu-frequenz
Copy link
Contributor Author

I think we just need the component ID for which a request failed, and the SDK can take action with just that.

The SDK would need more, I imagine. Say discharge commands were sent for 5 inverters. It would need to track the request IDs for each of these 5 discharge commands, otherwise it cannot know for which inverter the command failed.

@shsms
Copy link
Contributor

shsms commented Sep 6, 2024

Say discharge commands were sent for 5 inverters. It would need to track the request IDs for each of these 5 discharge commands, otherwise it cannot know for which inverter the command failed.

Not if the status update contains the component ID of the inverter for which something failed. That should always be there, right?

@shsms
Copy link
Contributor

shsms commented Sep 6, 2024

But never mind, option 1 does sound cleaner, modular, etc. So we can do that.

@tiyash-basu-frequenz
Copy link
Contributor Author

Not if the status update contains the component ID of the inverter for which something failed. That should always be there, right?

It will, but the caller won't know for which command until the request ID is used. E.g., it could be that the response the caller just got was a response to an older command, which it had stopped caring about after a timeout. So the request ID cannot be ignored in general. We have seen similar usage patterns elsewhere as well. This is also where things start getting more rabbit-hole-ish and complicated with option2.

But never mind, option 1 does sound cleaner, modular, etc. So we can do that.

Will submit a PR soon!

@shsms
Copy link
Contributor

shsms commented Sep 6, 2024

If an inverter had an issue, it shouldn't matter which request it was for, it should be the same action - block the inverter temporarily. This would be the implementation if we use option 1 as well because otherwise, option 1 would also lead to the same problem on the client side, of tracking whether to take action for the responses for a certain request ID.

@llucax
Copy link
Contributor

llucax commented Sep 6, 2024

Well not if you receive first a successful telephone from a latest command and then a failure for an old final, I guess in that case we can consider the component is ok, but it would be a very weird edge case I guess

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
part:protobuf Affects the protocol buffer definition files type:enhancement New feature or enhancement visitble to users
Projects
None yet
Development

No branches or pull requests

3 participants