Alarm server Kafka error handling #2265
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When the alarm server gets disconnected from Kafka, it cannot send alarm state updates.
One possible scenario:
Alarm server gets disconnected from Kafka (and clients as well).
Alarm server still has PV connections, one goes into alarm and then maybe out again. Alarm server latches alarm, but state update to Kafka is not possible (and clients wouldn't see it anyway right now because they are also disconnected).
Once Kafka returns online, it is unaware of the alarms. Newly started clients receive an outdated alarm state. Alarm server still tracks those alarms as active and will even emit the "There are N active alarms" messages every 15 minutes, clients annunciate it but show a different number of active alarms in the UI.
This update offers two additions:
A "resend" command in the alarm server can be invoked at any time to trigger a complete re-send of the state for each item in the alarm tree. It can be used to debug server/client inconsistencies.
More important, the alarm server monitors the Kafka connection. If it's lost, there's nothing it can do about that. But once it's restored, it performs a "resend" so Kafka and clients get updated to the most recent alarm server state, removing inconsistencies because of the dropped messages.
The Kafka client library offers no direct API to check the connection state, so a periodic "listTopics" call is used as suggested in https://stackoverflow.com/questions/38103198/how-to-check-kafka-consumer-state and other places. It can be configured and disabled via a new preference setting.