Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alarm server Kafka error handling #2265

Merged
merged 4 commits into from
May 20, 2022
Merged

Alarm server Kafka error handling #2265

merged 4 commits into from
May 20, 2022

Conversation

kasemir
Copy link
Collaborator

@kasemir kasemir commented May 19, 2022

When the alarm server gets disconnected from Kafka, it cannot send alarm state updates.
One possible scenario:
Alarm server gets disconnected from Kafka (and clients as well).
Alarm server still has PV connections, one goes into alarm and then maybe out again. Alarm server latches alarm, but state update to Kafka is not possible (and clients wouldn't see it anyway right now because they are also disconnected).
Once Kafka returns online, it is unaware of the alarms. Newly started clients receive an outdated alarm state. Alarm server still tracks those alarms as active and will even emit the "There are N active alarms" messages every 15 minutes, clients annunciate it but show a different number of active alarms in the UI.

This update offers two additions:

A "resend" command in the alarm server can be invoked at any time to trigger a complete re-send of the state for each item in the alarm tree. It can be used to debug server/client inconsistencies.

More important, the alarm server monitors the Kafka connection. If it's lost, there's nothing it can do about that. But once it's restored, it performs a "resend" so Kafka and clients get updated to the most recent alarm server state, removing inconsistencies because of the dropped messages.
The Kafka client library offers no direct API to check the connection state, so a periodic "listTopics" call is used as suggested in https://stackoverflow.com/questions/38103198/how-to-check-kafka-consumer-state and other places. It can be configured and disabled via a new preference setting.

kasemir added 4 commits May 19, 2022 10:59
It allowed entering "\nAnother   \n    Component" as name for new
component. Now that would become "Another Component", and the hint at
the bottom of the dialog updates more often.
If alarm server cannot send state updates because Kafka is down, clients
will miss these updates and server/client are out of lockstep.
This allows manual resending for tests or manual recovery.
@kasemir
Copy link
Collaborator Author

kasemir commented May 20, 2022

Fixes #2267

@kasemir kasemir merged commit 8bbc121 into master May 20, 2022
@shroffk shroffk deleted the alarmserver_kafkaerrors branch December 16, 2022 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant