[Task Manager] Kibana discovery service #187696

mikecote · 2024-07-05T15:54:20Z

To support task partitioning, we must make the Kibana nodes aware of how many nodes are currently running and what IDs they have to determine which Kibana node owns which task partitions consistently.

To accomplish this, I propose creating a new service within Kibana Task Manager that leverages Elasticsearch to determine the number of Kibana nodes running.

Requirements

A new background_task_node saved-object type
ID of saved-object determined by the Kibana node UUID (same used by task manager to claim tasks)
Fields contain lastSeen (date)
Each Kibana node upserts its document's lastSeen date on a 10s interval
API to fetch active Kibana nodes. Queries the SO for lastSeen within the last 30s and returns Kibana node UUIDs (document IDs).
Implement a deletion strategy for documents that have a lastSeen older than 5 minutes
Attempt to delete the document when Kibana receives a shutdown signal
SO type is hidden

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-07-05T15:54:22Z

Pinging @elastic/response-ops (Team:ResponseOps)

pmuellr · 2024-07-09T14:51:55Z

I wonder if we could (ab)use this for other things. For instance, determining how fresh maintenance windows are, and other persisted data, that we currently refresh for every rule execution. We could use something like the most recent document date, stored in the service data. When the service is updating it's heartbeat, it could also check these saved dates, see if we need to refresh them.

lukeelmers · 2024-07-09T17:17:14Z

To be clear, we'd only plan to use this internally in TM, right? (Not shared to other plugins as a service)

Attempt to delete the document when Kibana receives a shutdown signal

What is the impact on task partitioning if this fails, would it mean some tasks just don't get scheduled until the 5m last_seen is exceeded and the doc is cleaned up?

mikecote · 2024-07-09T17:24:17Z

@lukeelmers

To be clear, we'd only plan to use this internally in TM, right? (Not shared to other plugins as a service)

Correct, this is internal to Task Manager only. We won't expose anything by the plugin for others to consume.

What is the impact on task partitioning if this fails, would it mean some tasks just don't get scheduled until the 5m last_seen is exceeded and the doc is cleaned up?

Nothing negative occurs; this is more of a courtesy cleanup alongside a fallback that periodically cleans old documents so the index doesn't indefinitely grow as Kibana nodes stop running.

lukeelmers · 2024-07-09T17:27:08Z

Thanks for clarifying @mikecote, this makes sense ❤️

My only concern would be if this were exposed for more general purpose use. But considering we are thinking of it as an implementation detail of TM, then I'm not too worried about it.

mikecote · 2024-07-09T17:29:31Z

The discovery service will be used to assign task partitions to Kibana nodes. Knowing how many nodes are running, we'll ensure that only two nodes share any partition and adjust as Kibana nodes appear/disappear. I have more details on this GH issue (#187700), and I'm happy to expand further if you like.

kobelb · 2024-07-09T17:31:15Z

This will never be accurate because it assumes that clocks are synchronized, which is a fundamental principle of distributed systems. Just joking and wanted to get this argument out of the way. This does require clocks to be loosely synchronized to work correctly; however, given the mitigation of two nodes being responsible for each partition, there are already some mitigations in place and we do not need a high degree of precision for this to be utilized by task-manager. However, it is further reasoning to keep this internal to task-manager because for other usages this might be a fundamental flaw.

mikecote · 2024-07-09T17:33:42Z

Perhaps a better name for the SO type is background_task_node if we want to clarify that it is only for the task manager; this will be hard to change in the future.

pgayvallet · 2024-07-10T08:02:36Z

We've been talking about having that kind of "Kibana discovery mechanism" for litteraly years now. Like, this is one of the first discussions I remember having when I joined 5 years ago.

We've been discussing it a lot. So we sure do know this push/pull system is imperfect, has limitations, won't be as good as a proper discovery system, doesn't (directly/easily) provide things like leader election or such... So yeah, we do know it can only be used for very specific use case that needs to be carefully chosen.

However, that's still way better than what we have today - Nothing. Void. Nada. KibanaA has no idea if they are alone in the universe or if they have friends. And this is quite sad, in a way. KibanaA could do so many amazing things if they just had a rough idea about the approximative number of nodes in their cluster, and access to information related to those friends.

So, all this amazing story telling to say:

I absolutely get why, from a responseOps perspective, we would like to keep that internal to TM. It's the safe call. You have your use case in mind, and you don't want to open the pandora box of having to support the feature for other potential consumers. And this totally makes sense. From responseOps's perspective.

I think the standpoint from Core / Platform services should be different though. Personally, I know we've been waiting for years for a valid use case to finally be able to start working on that discovery system. Now we have would ihmo looks like the perfect opportunity, I really think we should be talking about the possibility to have this as a platform service instead of some internal implementation detail to TM.

Those were my 2cps.

elasticmachine · 2024-07-10T08:03:56Z

Pinging @elastic/kibana-core (Team:Core)

mikecote · 2024-07-10T12:01:35Z

I really think we should be talking about the possibility to have this as a platform service instead of some internal implementation detail to TM.

@pgayvallet would there be interest in moving this service to Core once we figure out how we need it to work for our use case? If so, maybe the Core team can review our approach and let us know what modifications will make it easier to move the service down the line (mappings, SO name, index, etc). I'm siding on this approach rather than having Core build it right away, given we don't know exactly how it should work and our immediate need for such a service. Open to ideas.

rudolf · 2024-07-10T12:28:37Z

++ This has come up several times, but usually a simpler less optimal solution that doesn't require discovery ends up being chosen. I think it would make a lot of sense to have this as part of the platform. But this doesn't mean that Core needs to build it and I think the priority should be that ResponseOps validates that discovery helps you with partitioning and increases TM throughput.

This was my stab at an algorithm #93029 (comment) . The biggest difference is that instead of relying on timestamps to detect liveness I try to detect liveness by checking for heartbeats. So the clocks can be out of sync but a node can still be alive.

pgayvallet · 2024-07-10T13:37:48Z

@mikecote

would there be interest in moving this service to Core once we figure out how we need it to work for our use case? If so, maybe the Core team can review our approach and let us know what modifications will make it easier to move the service down the line

Yeah I think it would make sense, that way we avoid blocking you on that initial implementation, but we would make sure that we could somewhat easily port the concept to a Core service later.

lukeelmers · 2024-07-10T20:40:12Z

that way we avoid blocking you on that initial implementation

++ I'm not categorically opposed to providing something like this as core service, but just want to make sure we are treating it as a separate discussion based on other valid use cases (besides just this one), and not blocking this effort on it. If Core can provide guidance along the way to make this easier to repurpose in the future if we need to, that sounds good to me. Let's just be sure we aren't sinking large amounts of time into R&D or adding too much complexity to the work ResponseOps is doing.

mikecote · 2024-07-11T17:10:44Z

The draft PR (#187997) is starting to shape up, and if anyone wants to see it in practice, it aligns with the issue description.

Thanks, @rudolf, for sharing your thoughts on such a service. If I understand correctly, your approach works with a single Elasticsearch document, and instead of timestamps, it uses hrtime. Is the core piece that the Kibana nodes compare the state object on an interval to determine which Kibana nodes are still "alive" by having a field value changed?

I like that the approach doesn't rely on the clocks being synchronized. I'm unsure if such a pattern could be applied when using a document per Kibana node while preventing the index from ever-growing? From comparing the approaches, my concern about having all the nodes update the same document is the increase in contention as more Kibana nodes are added (think 64 or even 150). The task manager starts seeing contention when multiple nodes try to claim the same tasks more than 5 times per second, and it becomes hard to work around after 10 times per second.

pgayvallet · 2024-07-12T07:57:15Z

As we agreed, I opened #188177 to discuss about what a "Core discovery service" would look like and identify which features could benefit from it.

mikecote added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jul 5, 2024

This was referenced Jul 5, 2024

[Task Manager] Task Partitioning #187698

Closed

[Task Manager] Assign task partitions to Kibana nodes #187700

Closed

mikecote assigned ersin-erdal Jul 8, 2024

pgayvallet added discuss Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc labels Jul 10, 2024

rudolf mentioned this issue Jul 10, 2024

Support sharding large workloads across multiple Kibana instances #93029

Closed

ersin-erdal mentioned this issue Jul 10, 2024

Kibana discovery service #187997

Merged

pgayvallet mentioned this issue Jul 12, 2024

[Core] Kibana discovery service #188177

Open

mikecote mentioned this issue Jul 12, 2024

Scaling the alerting throughput ceiling from 3,200 to 32,000+ rules per minute #188194

Open

48 tasks

ersin-erdal closed this as completed in #187997 Jul 17, 2024

ersin-erdal closed this as completed in 0a6b607 Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task Manager] Kibana discovery service #187696

[Task Manager] Kibana discovery service #187696

mikecote commented Jul 5, 2024 •

edited by ersin-erdal

Loading

elasticmachine commented Jul 5, 2024

pmuellr commented Jul 9, 2024

lukeelmers commented Jul 9, 2024

mikecote commented Jul 9, 2024 •

edited

Loading

lukeelmers commented Jul 9, 2024

mikecote commented Jul 9, 2024

kobelb commented Jul 9, 2024 •

edited

Loading

mikecote commented Jul 9, 2024

pgayvallet commented Jul 10, 2024

elasticmachine commented Jul 10, 2024

mikecote commented Jul 10, 2024

rudolf commented Jul 10, 2024

pgayvallet commented Jul 10, 2024

lukeelmers commented Jul 10, 2024

mikecote commented Jul 11, 2024

pgayvallet commented Jul 12, 2024

[Task Manager] Kibana discovery service #187696

[Task Manager] Kibana discovery service #187696

Comments

mikecote commented Jul 5, 2024 • edited by ersin-erdal Loading

Requirements

elasticmachine commented Jul 5, 2024

pmuellr commented Jul 9, 2024

lukeelmers commented Jul 9, 2024

mikecote commented Jul 9, 2024 • edited Loading

lukeelmers commented Jul 9, 2024

mikecote commented Jul 9, 2024

kobelb commented Jul 9, 2024 • edited Loading

mikecote commented Jul 9, 2024

pgayvallet commented Jul 10, 2024

elasticmachine commented Jul 10, 2024

mikecote commented Jul 10, 2024

rudolf commented Jul 10, 2024

pgayvallet commented Jul 10, 2024

lukeelmers commented Jul 10, 2024

mikecote commented Jul 11, 2024

pgayvallet commented Jul 12, 2024

mikecote commented Jul 5, 2024 •

edited by ersin-erdal

Loading

mikecote commented Jul 9, 2024 •

edited

Loading

kobelb commented Jul 9, 2024 •

edited

Loading