-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Task Manager] Kibana discovery service #187696
Comments
Pinging @elastic/response-ops (Team:ResponseOps) |
I wonder if we could (ab)use this for other things. For instance, determining how fresh maintenance windows are, and other persisted data, that we currently refresh for every rule execution. We could use something like the most recent document date, stored in the service data. When the service is updating it's heartbeat, it could also check these saved dates, see if we need to refresh them. |
To be clear, we'd only plan to use this internally in TM, right? (Not shared to other plugins as a service)
What is the impact on task partitioning if this fails, would it mean some tasks just don't get scheduled until the 5m |
Correct, this is internal to Task Manager only. We won't expose anything by the plugin for others to consume.
Nothing negative occurs; this is more of a courtesy cleanup alongside a fallback that periodically cleans old documents so the index doesn't indefinitely grow as Kibana nodes stop running. |
Thanks for clarifying @mikecote, this makes sense ❤️ My only concern would be if this were exposed for more general purpose use. But considering we are thinking of it as an implementation detail of TM, then I'm not too worried about it. |
The discovery service will be used to assign task partitions to Kibana nodes. Knowing how many nodes are running, we'll ensure that only two nodes share any partition and adjust as Kibana nodes appear/disappear. I have more details on this GH issue (#187700), and I'm happy to expand further if you like. |
This will never be accurate because it assumes that clocks are synchronized, which is a fundamental principle of distributed systems. Just joking and wanted to get this argument out of the way. This does require clocks to be loosely synchronized to work correctly; however, given the mitigation of two nodes being responsible for each partition, there are already some mitigations in place and we do not need a high degree of precision for this to be utilized by task-manager. However, it is further reasoning to keep this internal to task-manager because for other usages this might be a fundamental flaw. |
Perhaps a better name for the SO type is |
We've been talking about having that kind of "Kibana discovery mechanism" for litteraly years now. Like, this is one of the first discussions I remember having when I joined 5 years ago. We've been discussing it a lot. So we sure do know this push/pull system is imperfect, has limitations, won't be as good as a proper discovery system, doesn't (directly/easily) provide things like leader election or such... So yeah, we do know it can only be used for very specific use case that needs to be carefully chosen. However, that's still way better than what we have today - Nothing. Void. Nada. KibanaA has no idea if they are alone in the universe or if they have friends. And this is quite sad, in a way. KibanaA could do so many amazing things if they just had a rough idea about the approximative number of nodes in their cluster, and access to information related to those friends. So, all this amazing story telling to say: I absolutely get why, from a responseOps perspective, we would like to keep that internal to TM. It's the safe call. You have your use case in mind, and you don't want to open the pandora box of having to support the feature for other potential consumers. And this totally makes sense. From responseOps's perspective. I think the standpoint from Core / Platform services should be different though. Personally, I know we've been waiting for years for a valid use case to finally be able to start working on that discovery system. Now we have would ihmo looks like the perfect opportunity, I really think we should be talking about the possibility to have this as a platform service instead of some internal implementation detail to TM. Those were my 2cps. |
Pinging @elastic/kibana-core (Team:Core) |
@pgayvallet would there be interest in moving this service to Core once we figure out how we need it to work for our use case? If so, maybe the Core team can review our approach and let us know what modifications will make it easier to move the service down the line (mappings, SO name, index, etc). I'm siding on this approach rather than having Core build it right away, given we don't know exactly how it should work and our immediate need for such a service. Open to ideas. |
++ This has come up several times, but usually a simpler less optimal solution that doesn't require discovery ends up being chosen. I think it would make a lot of sense to have this as part of the platform. But this doesn't mean that Core needs to build it and I think the priority should be that ResponseOps validates that discovery helps you with partitioning and increases TM throughput. This was my stab at an algorithm #93029 (comment) . The biggest difference is that instead of relying on timestamps to detect liveness I try to detect liveness by checking for heartbeats. So the clocks can be out of sync but a node can still be alive. |
Yeah I think it would make sense, that way we avoid blocking you on that initial implementation, but we would make sure that we could somewhat easily port the concept to a Core service later. |
++ I'm not categorically opposed to providing something like this as core service, but just want to make sure we are treating it as a separate discussion based on other valid use cases (besides just this one), and not blocking this effort on it. If Core can provide guidance along the way to make this easier to repurpose in the future if we need to, that sounds good to me. Let's just be sure we aren't sinking large amounts of time into R&D or adding too much complexity to the work ResponseOps is doing. |
The draft PR (#187997) is starting to shape up, and if anyone wants to see it in practice, it aligns with the issue description. Thanks, @rudolf, for sharing your thoughts on such a service. If I understand correctly, your approach works with a single Elasticsearch document, and instead of timestamps, it uses I like that the approach doesn't rely on the clocks being synchronized. I'm unsure if such a pattern could be applied when using a document per Kibana node while preventing the index from ever-growing? From comparing the approaches, my concern about having all the nodes update the same document is the increase in contention as more Kibana nodes are added (think 64 or even 150). The task manager starts seeing contention when multiple nodes try to claim the same tasks more than 5 times per second, and it becomes hard to work around after 10 times per second. |
As we agreed, I opened #188177 to discuss about what a "Core discovery service" would look like and identify which features could benefit from it. |
To support task partitioning, we must make the Kibana nodes aware of how many nodes are currently running and what IDs they have to determine which Kibana node owns which task partitions consistently.
To accomplish this, I propose creating a new service within Kibana Task Manager that leverages Elasticsearch to determine the number of Kibana nodes running.
Requirements
background_task_node
saved-object typelastSeen
(date)lastSeen
date on a 10s intervallastSeen
within the last 30s and returns Kibana node UUIDs (document IDs).lastSeen
older than 5 minutesThe text was updated successfully, but these errors were encountered: