Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task Manager] Assign task partitions to Kibana nodes #187700

Closed
4 tasks
mikecote opened this issue Jul 5, 2024 · 2 comments · Fixed by #188758
Closed
4 tasks

[Task Manager] Assign task partitions to Kibana nodes #187700

mikecote opened this issue Jul 5, 2024 · 2 comments · Fixed by #188758
Assignees
Labels
Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

mikecote commented Jul 5, 2024

Depends on #187696
Depends on #187698

Once Kibana is aware of the running instances and we have task partitioning in place, we should assign a subset of the partitions to each Kibana node so only two Kibana nodes fight for the same tasks.

Requirements

  • Assigned task partitions logic is calculated every 10 seconds on each Kibana node
  • Task claiming logic is updated to filter for the subset of task partitions or tasks with missing partition values
  • Partitions are assigned in a round-robin manner like in the PoC so the map is the same for each Kibana node.
  • Only applies when mget task claiming strategy is used
@mikecote mikecote added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jul 5, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@mikecote
Copy link
Contributor Author

@doakalexi I added a note in the requirements that assigning partitions to Kibana nodes is exclusive to the mget task claiming strategy. There is no need to modify the default task claimer, which your code aligns with already :).

doakalexi added a commit that referenced this issue Jul 19, 2024
Resolves #187700

## Summary

This PR uses the discovery service assign a subset of the partitions to
each Kibana node so only two Kibana nodes fight for the same tasks.

### Checklist

- [ ] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios


### To verify

This change is only for mget, so add the following to `kibana.yml`
```
xpack.task_manager.claim_strategy: 'unsafe_mget'
```
**Testing locally**

Old tasks
- Checkout main and create a new rule, let it run
- Stop kibana
- Checkout this branch and restart kibana
- Verify that the on first run after restarting (when the task does not
have a partition) the rule runs. It might be helpful to create a rule
with a long interval and use run soon.

<details>
<summary>New tasks, but it might be easier to just test on
cloud</summary>

- Start Kibana
- Replace this
[line](https://github.com/elastic/kibana/pull/188368/files#diff-46ca6f79fdc2b69e1d6ddc2401eab6469f8dfb9521f93f90132de624a9693aa5R48)
with the following
```
return [this.podName, 'w', 'x', 'y', 'z'];
```
- Create a few rules and check their partition values using the example
query below:
```
POST .kibana_task_manager*/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "task.taskType": {
              "value": "alerting:.es-query"
            }
          }
        }
      ]
    }
  }
}
```
- Using the the partition map that is expected to be generated for the
current kibana node, verify that the tasks with partitions in the map
run and tasks with partitions that are not in the map do not run.

```
[
  0, 2, 5, 7, 10, 12, 15, 17, 20, 22, 25, 27, 30, 32, 35, 37, 40, 42, 45, 47, 50, 52, 55, 57, 60,
  62, 65, 67, 70, 72, 75, 77, 80, 82, 85, 87, 90, 92, 95, 97, 100, 102, 105, 107, 110, 112, 115,
  117, 120, 122, 125, 127, 130, 132, 135, 137, 140, 142, 145, 147, 150, 152, 155, 157, 160,162,
  165, 167, 170, 172, 175, 177, 180, 182, 185, 187, 190, 192, 195, 197, 200, 202, 205, 207, 210,
  212, 215, 217, 220, 222, 225, 227, 230, 232, 235, 237, 240, 242, 245, 247, 250, 252, 255
]
```
</details>

**Testing on cloud**

- The PR has been deployed to cloud, and you can create multiple rules
and verify that they all run. If some reason they do not run, that means
the nodes are not picking up their assigned partitions correctly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
3 participants