Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement leader election for the target allocator #1061

Closed
jaronoff97 opened this issue Aug 25, 2022 · 4 comments · Fixed by #1087
Closed

Implement leader election for the target allocator #1061

jaronoff97 opened this issue Aug 25, 2022 · 4 comments · Fixed by #1087
Labels
area:target-allocator Issues for target-allocator

Comments

@jaronoff97
Copy link
Contributor

jaronoff97 commented Aug 25, 2022

Right now, the target allocator's allocation strategy (least connection) means that you can only run a single TA pod at a time. This proves difficult if a consumer wants a high availability option for their target allocation. In order to make this possible, we could use the built in go leader election package for the target allocator.

The rough process that a collector does looks like this right now:

  1. collector starts up
  2. prom receiver loads configuration
  3. for each job
  4. http_sd_config queries target allocator
  5. list of targets and metadata is returned from TA
  6. prom receiver runs relabel_configs on targets
  7. prom receiver scrapes targets remaining
  8. prom receiver applies metric_relabel_configs
  9. collector converts prom to otel
  10. moves config to processor stage ...

If the target allocator is down at step 4, the job will fail (and most likely the entire the scrape config.) Adding in support for HA would improve the reliability of the statefulset collector.

If this is something the community would like, I would be happy to implement and test it.

@Aneurysm9
Copy link
Member

Interesting suggestion. Improving the resiliency of the target allocation layer would definitely be a plus. Would the followers use the existing API to obtain state from the leader, or would you expect an active-passive setup with failover?

@secustor
Copy link
Member

secustor commented Aug 26, 2022

Yes, some kind of HA is definitely needed for TA.

Regarding leader election and state sharing I throw memberlist in to the ring. Grafana uses it in all its distributed products ( Mimir, Loki, Tempo, ... ), so it seems stable.

That way we can simply point the collector to TA using a service, as we do now and don't need to think about manual failovers.

@jaronoff97
Copy link
Contributor Author

@Aneurysm9 I was expecting an active-passive setup with failover so as to not complicate any of the existing logic. @secustor using memberlist, would we instead do state sharing, or would we just use that to determine who the active is?

@jeromeinsf
Copy link

@secustor memberlist might need to be adapted to get to a propagation delay adapted for an HA expectation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:target-allocator Issues for target-allocator
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants