Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce heartbeats #190

Merged
merged 25 commits into from
Jul 29, 2024

Conversation

jaronoff97
Copy link
Contributor

@jaronoff97 jaronoff97 commented May 7, 2024

This PR introduces a client side heartbeat that can be set via the OpAMP Connection Settings. After discussing more in the past SIG meeting, I worked through the following pro/con for a client side vs server side heartbeat.

closes #183

Client Side

A client side heartbeat would be negotiated between the client and server, where the client's default heartbeat is set to 30s. On connect, the server has an option to entirely disable heartbeats by setting an explicit 0 for the field. The server can also offer a different heartbeat interval depending on its infrastructure's needs. After connection, the agent will begin a timer and each N seconds will send a message that minimally contains its instance id. An HTTP connection will use this interval for its polling as well. Some agents that are not directly informed of health changes should also use this for component health reporting.

Pros

  1. The agent is able to prevent a proxy from timing out the socket connection
  2. The agent's HTTP polling interval is now configurable
  3. The server can properly age out and remove dead agents
  4. Requires a single successful message for heartbeat processing

Cons

  1. requires a proto change
  2. extra work for the client
    a. the client will now need to keep track of a heartbeat timer to send this periodic message

Server Side

A server side heartbeat would simply be a part of opamp-go and would require no changes to the spec to allow this to work. The server every N seconds would send an empty message to the client to keep the socket connection active.

Pros

  1. Requires no spec or proto changes
  2. Server is in control of the interval

Cons

  1. Server has no way to determine if an Agent is dead
    a. The core value of the change to me is that the server can now rely on the fact it is receiving a message every N seconds and can take action if that is the case
    b. If the client is using an http transport, there is no way for the server to reliably send a heartbeat message to guarantee the liveness of the agent. Say the server 'requests' a heartbeat from the client, but the client is already dead
  2. The only way to send an 'empty' server to agent message today is by using the report full state flag. This means the message back from the agent is going to be larger than necessary solely to keep the connection.
    a. We could also add a heartbeat flag as part of the message
  3. Requires three successful messages for heartbeat processing
    a. A server would need to successfully send the heartbeat flagged message over the websocket, the client would then send its heartbeat back via an AgentToServer message, and the server would need to ACK with a responding ServerToAgent message.

I think given the pros and cons of the above, I prefer an Agent heartbeat over a server heartbeat. If we need a new proto change anyway to introduce a heartbeat flag, I think the client approach is more effective. Furthermore, this change helps provide guidance for agents that are not informed of status updates. By setting an explicit heartbeat for the client, the server can increase the granularity of the agent's status updates. The client heartbeat approach also matches the design for a conventional deadman's switch – something that is constantly sending a signal out for a receiver to detect only when that signal is no longer received. Flipping that design removes that guarantee and weakens the overall feature.

Finally, the server would also be able to explicitly disconnect misbehaving clients and force them to reconnect with new settings. If the server were to not receive a heartbeat within its set window, the server could initiate a disconnect to gracefully close the client. This approach would not work as well for a server heartbeat as it would need to cancel the initial server to agent message.

Open Questions

  • Where should we put more information about client-side heartbeating should work

References

  • RabbitMQ
    • Rabbit prefers client heartbeats over server ones, AND explicit heartbeat instead of TCP keepalives
  • MQTT
    • Functionally the same as this proposal, they call them keep alives
  • Phoenix Socket Client
    • Works by the client sending a specific heartbeat message

Copy link
Member

@mwear mwear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM. The reasoning behind client side heartbeats makes sense and your references to other well known projects helps validate that decision.

proto/opamp.proto Outdated Show resolved Hide resolved
specification.md Outdated Show resolved Hide resolved
specification.md Outdated Show resolved Hide resolved
@jaronoff97 jaronoff97 marked this pull request as ready for review May 8, 2024 19:56
@jaronoff97 jaronoff97 requested a review from a team May 8, 2024 19:56
specification.md Outdated Show resolved Hide resolved
proto/opamp.proto Outdated Show resolved Hide resolved
specification.md Outdated Show resolved Hide resolved
proto/opamp.proto Outdated Show resolved Hide resolved
specification.md Outdated Show resolved Hide resolved
Copy link
Member

@tpaschalis tpaschalis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, and I think the client-side approach would work best.

specification.md Outdated Show resolved Hide resolved
specification.md Outdated
the Agent is still alive and active. A server could use the heartbeat to make decisions about
the liveness of the connected Agent.

A default of a 30s should be used if not set by the OpAMPConnectionSettings.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As brought up in #183, should we write a note about clients respecting values negotiated from the server or coming from a backpressure mechanism (eg. a Retry-After header?)

@tigrannajaryan
Copy link
Member

Some agents that are not directly informed of health changes should also use this for component health reporting.

@jaronoff97 Can you clarify this? I am not sure I fully understand what this means.

@jaronoff97
Copy link
Contributor Author

@tigrannajaryan right now the bridge is not informed of the health of the agents it monitors, but instead uses the heartbeat interval to query the component health in the cluster. I figured that other agents that are doing similar interval-based health operations should do the same. I can clarify more in the spec to make that clear.

specification.md Outdated Show resolved Hide resolved
proto/opamp.proto Outdated Show resolved Hide resolved
proto/opamp.proto Outdated Show resolved Hide resolved
proto/opamp.proto Show resolved Hide resolved
Copy link
Member

@mwear mwear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One formatting suggestion to appease markdown lint.

specification.md Show resolved Hide resolved
Copy link

@evan-bradley evan-bradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me on whole, just a few notes regarding wording.

proto/opamp.proto Outdated Show resolved Hide resolved
specification.md Outdated Show resolved Hide resolved
specification.md Outdated Show resolved Hide resolved
specification.md Show resolved Hide resolved
specification.md Outdated Show resolved Hide resolved
specification.md Outdated Show resolved Hide resolved
specification.md Outdated Show resolved Hide resolved
specification.md Outdated Show resolved Hide resolved
proto/opamp.proto Outdated Show resolved Hide resolved
proto/opamp.proto Outdated Show resolved Hide resolved
specification.md Outdated Show resolved Hide resolved
specification.md Outdated Show resolved Hide resolved
proto/opamp.proto Outdated Show resolved Hide resolved
specification.md Outdated Show resolved Hide resolved
specification.md Outdated Show resolved Hide resolved
proto/opamp.proto Outdated Show resolved Hide resolved
proto/opamp.proto Outdated Show resolved Hide resolved
proto/opamp.proto Outdated Show resolved Hide resolved
specification.md Outdated Show resolved Hide resolved
specification.md Show resolved Hide resolved
Copy link

@evan-bradley evan-bradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for all your work on this.

Copy link
Member

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a couple final nits.

Thanks for the patience @jaronoff97 !

proto/opamp.proto Outdated Show resolved Hide resolved
specification.md Outdated Show resolved Hide resolved
specification.md Outdated Show resolved Hide resolved
@tigrannajaryan
Copy link
Member

@andykellr if you would like to take a look, otherwise I will merge it.

@andykellr
Copy link
Contributor

You can merge. I've been following along and looks great.

Copy link
Member

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thing I forgot to check earlier: the status labels.

Let's use [Development] for now.

proto/opamp.proto Show resolved Hide resolved
proto/opamp.proto Outdated Show resolved Hide resolved
specification.md Outdated Show resolved Hide resolved
specification.md Show resolved Hide resolved
jaronoff97 and others added 2 commits July 29, 2024 16:32
Signed-off-by: Jacob Aronoff <[email protected]>
@tigrannajaryan tigrannajaryan merged commit 58acf6b into open-telemetry:main Jul 29, 2024
5 checks passed
@johnphilip283
Copy link

great work, thank you @jaronoff97!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OpAMP Agent heartbeats
8 participants