Watcher executes on a node after the .watches shard has been moved to a different node #105933

masseyke · 2024-03-04T23:53:39Z

Elasticsearch Version

8.8.2 (but likely others as well)

Installed Plugins

No response

Java Version

bundled

OS Version

n/a

Problem Description

This happens only rarely, but I have seen evidence of it twice in the same cluster a few days apart. As part of normal shard reallocation, a .watches shard gets moved off of a node (I'll call it nodeA) and onto another node (nodeB). This ought to mean that nodeA stops running watches. And we see in nodeA's logs:

paused watch execution, reason [no local watcher shards found], cancelled [0] queued tasks

But if we search .watcher-history-*, we see that now both nodeA and nodeB are executing the same watch, at nearly the same time on the same schedule. So instead of getting executed once every 10 minutes (for example), the watch gets executed twice every 10 minutes.
Aside from the message above, I haven't seen anything relevant in the logs. Restarting the nodes solves the problem.

Steps to Reproduce

Unknown

Logs (if relevant)

No response

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2024-03-04T23:54:02Z

Pinging @elastic/es-data-management (Team:Data Management)

masseyke · 2024-03-04T23:57:04Z

Also, my initial thought was just that since watcher history records are written asynchronously, there could be a little lag. But it goes on for days until the cluster is restarted, with the same timestamps as the watcher history records from nodeB.

masseyke · 2024-04-26T22:19:59Z

I still don't understand this, but I've been able to manually reproduce it by just artificially slowing down the pause logic. I'm writing the steps here since it's late on a Friday and I don't want to forget:

I started a 3-node, single master cluster
I created a watch that polls every minute (I just used the watch from https://www.elastic.co/guide/en/elasticsearch/reference/current/watcher-api-put-watch.html#watcher-api-put-watch-example)
I put debugger breakpoints in various places in the pause logic -- in TickerScheduleTriggerEngine::pauseExecution and WatcherService::pauseExecution
I had .watches shards on node-1 and node-2
I hooked the debugger into node-2
I used the _cluster/reroute API to move the .watches shards from node-2 to node-3
I caught the breakpoints, looked around for a few seconds, and then let them go
Now both node-2 and node-3 are running the single watch!

Relevant curl commands:
Create the watch:

curl -X PUT "localhost:9200/_watcher/watch/my-watch?pretty" -H 'Content-Type: application/json' -d'
{
  "trigger" : {
    "schedule" : { "cron" : "0 0/1 * * * ?" }
  },
  "input" : {
    "search" : {
      "request" : {
        "indices" : [
          "logstash*"
        ],
        "body" : {
          "query" : {
            "bool" : {
              "must" : {
                "match": {
                   "response": 404
                }
              },
              "filter" : {
                "range": {
                  "@timestamp": {
                    "from": "{{ctx.trigger.scheduled_time}}||-5m",
                    "to": "{{ctx.trigger.triggered_time}}"
                  }
                }
              }
            }
          }
        }
      }
    }
  },
  "condition" : {
    "compare" : { "ctx.payload.hits.total" : { "gt" : 0 }}
  },
  "actions" : {
    "email_admin" : {
      "email" : {
        "to" : "[email protected]",
        "subject" : "404 recently encountered"
      }
    }
  }
}
'

Move the shards:

curl -X POST "localhost:9200/_cluster/reroute?metric=none&pretty" -H 'Content-Type: application/json' -d'
{
  "commands": [
    {                      
      "move": {
        "index": ".watches", "shard": 0,
        "from_node": "node-2", "to_node": "node-3"
      }
    }
  ]
}
'

Check watcher history:

curl localhost:9200/.watcher-history-*/_search?size=100  -H 'Content-Type: application/json' -d'
{
  "sort": [
    { "@timestamp": "desc"}
  ]
}
' | python3 -mjson.tool | grep node

Also kind of interesting -- several times after the problem began a cluster state change has come through and WatcherLifecycleService::pauseExecution is called. If it were to call watcherService.pauseExecution(reason); it would most likely stop the errant watch on node-2. But since there are no allocation ids, clearAllocationIds() returns false, so it never calls watcherService.pauseExecution(reason).

So it seems to be a race condition in the pause logic. Something seems to be calling TickerScheduleTriggerEngine::add after TickerScheduleTriggerEngine::pauseExecution runs.

masseyke · 2024-05-03T19:17:05Z

Here's how the race condition works:
Say I pause watcher because the shards have been allocated off of this machine. This happens more frequently due to the new shard allocator in 8.6.0. So TickerScheduleTriggerEngine.pauseExecution() calls schedules.clear(). But while pause was running, there was a regularly-scheduled watch running. After pause completes, we get the result back, and thanks to WatcherIndexingListener, we make sure to add the watch to the trigger service. TickerScheduleTriggerEngine.add sees that this watch is not in schedules so it helpfully puts it in there for us. So now this watch is running on this machine, and whatever machine the shard got allocated to!

masseyke · 2024-06-13T20:44:23Z

I have tried to automate this race condition into a test, with no luck so far.

First I tried writing an AbstractWatcherIntegrationTestCase for it, before realizing that the class that I believe has the race condition (TickerScheduleTriggerEngine) is mocked out in that test with ScheduleTriggerEngineMock.

Then I tried writing an ESIntegTestCase briefly, before realizing that there was probably a reason so much was mocked in AbstractWatcherIntegrationTestCase -- I was unable to get the server to start up with Watcher running TickerScheduleTriggerEngine.

So I wrote an ESRestTestCase, based on SmokeTestWatcherTestSuiteIT. I brought up a 5-node cluster. I created a watch that executes every 10ms (which required a bit of a hack since we normally prevent that). I reallocate one of the two .watches shards constantly -- as soon as I find that it has finished reallocating, I move it again. I do this a few dozen times (all while the watch is running over and over). Then I look at the watch history, and wait until the most recent 10 entries have all run on the same node (if we hit the race condition I'd expect the watch to be running on 2 or more nodes). Unfortunately the test succeeds every single time. I also artificially slowed down the watch, hoping to make the race condition more likely. No luck.

masseyke added >bug :Data Management/Watcher labels Mar 4, 2024

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Mar 4, 2024

masseyke mentioned this issue Jun 24, 2024

Do not run TickerScheduleTriggerEngine watches if the schedule trigger engine is paused #110061

Merged

masseyke closed this as completed in #110061 Jul 8, 2024

masseyke mentioned this issue Jul 9, 2024

Do not run TickerScheduleTriggerEngine watches if the schedule trigger engine is paused (#110061) #110674

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watcher executes on a node after the .watches shard has been moved to a different node #105933

Watcher executes on a node after the .watches shard has been moved to a different node #105933

masseyke commented Mar 4, 2024

elasticsearchmachine commented Mar 4, 2024

masseyke commented Mar 4, 2024

masseyke commented Apr 26, 2024

masseyke commented May 3, 2024

masseyke commented Jun 13, 2024

Watcher executes on a node after the .watches shard has been moved to a different node #105933

Watcher executes on a node after the .watches shard has been moved to a different node #105933

Comments

masseyke commented Mar 4, 2024

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

elasticsearchmachine commented Mar 4, 2024

masseyke commented Mar 4, 2024

masseyke commented Apr 26, 2024

masseyke commented May 3, 2024

masseyke commented Jun 13, 2024