Skip to content
This repository has been archived by the owner on Nov 8, 2022. It is now read-only.

RFC: Tribe Clusters and Worker Pattern #1584

Open
dishmael opened this issue Apr 3, 2017 · 8 comments
Open

RFC: Tribe Clusters and Worker Pattern #1584

dishmael opened this issue Apr 3, 2017 · 8 comments

Comments

@dishmael
Copy link
Contributor

dishmael commented Apr 3, 2017

Summary

Often, we have a need to gather metrics from a remote system from a centralized collector or, ideally, cluster (tribe) of snap collectors. The overarching goal is to define a single task to collect one or more metrics from a remote node and submit that task to the tribe for collection by assigning the task to a worker.

Proposal

At a configurable rate, the collectors would vote on which collector would be a master and which collectors would be used to gather the metric(s) defined in a task shared amongst members in a tribe. This can be achieved using, for example, Raft - https://raft.github.io. Busy collectors would be naturally slower to respond and so faster, under/less utilized collectors would be selected for gathering those metrics. Tribe HA (this RFC) is configurable as a grouping option allowing users to define which cluster members will operate in an HA model since not all snap telemetry tribes need to be HA.

Motivation/Use Cases

The link above (RAFT) has a decent description of how cluster consensus might work in the Tribe architecture. The following motivation and use cases are targeted.

  • Tribe configurable; not all members need HA
  • Tribe membership follows existing paradigm; all members obtain plugins and task definitions
  • Consensus voting amongst tribe cluster members to determine Master
  • Upon election, Master assigns tasking to workers (publish/subscribe model?)
  • Re-election occurs at predefined periods and may be based on snap telemetry daemon utilization
  • Task tracking needs to be considered to ensure all tasks are completed
  • Task execution follows existing paradigm; collection --> processing --> publishing

Benefits

Utilizing a cluster that has a Master/Workers architecture ensures high availability without duplicate polling. A task can be defined once, submitted to the tribe, executed only once, and guaranteed to collect from one of the workers.

Drawbacks

This may add overhead to the Tribes, certainly increasing the amount of cross chatter between snap telemetry instances.

Definitions

The following definitions are used in this RFC:

  • Master: A Node in a Tribe that has been elected to assign tasks to Workers in the tribe cluster. There can be only a single instance of a Master in a Tribe cluster.
  • Node: An instance of the snapteld daemon (may run one or more on a physical/virtual host).
  • Tribe: A collection of Nodes
  • Worker: One or more Nodes in a Tribe cluster that is not the Master and is designated to execute tasks.

Issues Addressed

The following issues would be satisfied by implementing this RFC:

@candysmurf
Copy link
Contributor

@dishmael, thanks for your RFC. What you proposed here is more like RAFT or Zookeeper which would be great if there is a need to coordinate across clusters/tribes. It's definitely a good direction to go.

I think #773 is a low hanging fruit. Will #773 help your use cases?

@snapbot snapbot added the tracked label Apr 4, 2017
@dishmael
Copy link
Contributor Author

dishmael commented Apr 4, 2017

@candysmurf this RFC would satisfy the need of #773 (HA) and #1558 (No Duplicate Polling).

@jtlisi
Copy link
Contributor

jtlisi commented Apr 4, 2017

I feel I have something to add to this. I think the idea of distributing a task between a tribe is a great one in principle. I would really want to see ideas on how this would be implemented since I have some particular use cases in mind.

The primary use case I had in mind was the service discovery and collection of metrics from container bound applications. For example if you had a pod in kubernetes running a group of containers that all host a */metrics endpoint with application metrics. I would want to use the feature to dynamically schedule the collection of metrics from these endpoints.

In the above use case sharing a task is useful to accomplishing this. However, this feature seems incomplete without some associated form of service discovery. Snap needs a way to schedule and un-schedule shared tasks based on contextual data parsed using some form of service discovery similar to how Prometheus would collect from a Kubernetes cluster.

This feature doesn't necessarily have to be integrated directly into snap. This could be done using an external scheduling daemon that exists outside of snap and interacts with it using the Snap Rest API. Or it can be a directly instrumented as a new type of plugin designed for shared tasks that can pass configuration forward to a set of collectors.

Let me know what you guys think of this idea, it's something I feel would be really useful in a container based deployment.

@candysmurf
Copy link
Contributor

candysmurf commented Apr 4, 2017

I have to agree that @jtlisi has a good point that something may be achieved outside Snap. @dishmael, would you please add more your thoughts into how this will work with containers' replicas?

@jcooklin
Copy link
Collaborator

jcooklin commented Apr 5, 2017 via email

@andrzej-k
Copy link
Contributor

So it seems that before we will be able to implement this RFC we need to separate tribe from main Snap repo, is that right @jcooklin ?

@jtlisi If you'd like to monitor applications in Kubernetes you could also think about creating Snap Third Party Resource which will associate application (and its metric endpoint) with Snap (task manifest). Then you would need a watcher on Kubernetes API which will tell you when new application pod is started and check whether corresponding Snap TPR is running. Having this information all you need would be some automation to load plugins and tasks. We will be implementing such solution in the future. Work will be done under: https://github.com/intelsdi-x/snap-integration-kubernetes

@candysmurf
Copy link
Contributor

@andrzej-k, do you have statistics of how many of our customers are using tribe?

@jcooklin
Copy link
Collaborator

jcooklin commented Apr 6, 2017 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants