Replies: 1 comment 2 replies
-
|
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I opened this thread as a place to start a discussion about how we can improve our alertmanager and ruler storage backends. Currently those backends suffer from high latency at scale. This is caused by the following factors:
Possible Solutions:
Possible coordinating mechanisms for managing an index:
Ring (Leader-Election): The simplest form of coordination would be to instantiate a Ring with a replication factor of 1. Then we would hard code a token value and the API in the ring responsible for that token is the leader. All HTTP requests will then be forwarded to the leader to ensure a single write to object storage and avoid concurrent writes to the same key (specifically the index). For gossip this will require significant tuning to ensure we have a stable ring before allowing writes.
Exposing CAS primitives for Object Storage: GCS and Azure support conditionals that would allow an index to be maintained and protected from concurrent writes. However, S3 still does not. If we support an additional Locker (DynamoDB, see Terraform for example) for S3 and use the native implementation for other solutions, we could implement this using the object storage backends alone.
KV Leader Election: Instead of reusing the ring lifecycler code for leader election we could implement a proper leader election algorithm on top of the KV interface. This would allow us to be more efficient with the election process and ideally expose more relevant and sane configuration options for tuning.
Log Device (Virtual Consensus): Each rules API could maintain a log of events persisted to their API and persist this log to S3. This requires a unique ID for each rules API instance. These logs are only mutable by the agent that created them. Replaying these logs will allow each instance to build an index in real time of the state of the rules store backend. This is a bit of a TLDR but this idea was taken from this paper
Existing Questions:
Beta Was this translation helpful? Give feedback.
All reactions