-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watcher executes on a node after the .watches shard has been moved to a different node #105933
Comments
Pinging @elastic/es-data-management (Team:Data Management) |
Also, my initial thought was just that since watcher history records are written asynchronously, there could be a little lag. But it goes on for days until the cluster is restarted, with the same timestamps as the watcher history records from nodeB. |
I still don't understand this, but I've been able to manually reproduce it by just artificially slowing down the pause logic. I'm writing the steps here since it's late on a Friday and I don't want to forget:
Relevant curl commands:
Move the shards:
Check watcher history:
Also kind of interesting -- several times after the problem began a cluster state change has come through and WatcherLifecycleService::pauseExecution is called. If it were to call So it seems to be a race condition in the pause logic. Something seems to be calling TickerScheduleTriggerEngine::add after TickerScheduleTriggerEngine::pauseExecution runs. |
Here's how the race condition works: |
I have tried to automate this race condition into a test, with no luck so far. First I tried writing an AbstractWatcherIntegrationTestCase for it, before realizing that the class that I believe has the race condition (TickerScheduleTriggerEngine) is mocked out in that test with ScheduleTriggerEngineMock. Then I tried writing an ESIntegTestCase briefly, before realizing that there was probably a reason so much was mocked in AbstractWatcherIntegrationTestCase -- I was unable to get the server to start up with Watcher running TickerScheduleTriggerEngine. So I wrote an ESRestTestCase, based on SmokeTestWatcherTestSuiteIT. I brought up a 5-node cluster. I created a watch that executes every 10ms (which required a bit of a hack since we normally prevent that). I reallocate one of the two .watches shards constantly -- as soon as I find that it has finished reallocating, I move it again. I do this a few dozen times (all while the watch is running over and over). Then I look at the watch history, and wait until the most recent 10 entries have all run on the same node (if we hit the race condition I'd expect the watch to be running on 2 or more nodes). Unfortunately the test succeeds every single time. I also artificially slowed down the watch, hoping to make the race condition more likely. No luck. |
Elasticsearch Version
8.8.2 (but likely others as well)
Installed Plugins
No response
Java Version
bundled
OS Version
n/a
Problem Description
This happens only rarely, but I have seen evidence of it twice in the same cluster a few days apart. As part of normal shard reallocation, a
.watches
shard gets moved off of a node (I'll call itnodeA
) and onto another node (nodeB
). This ought to mean that nodeA stops running watches. And we see in nodeA's logs:But if we search
.watcher-history-*
, we see that now both nodeA and nodeB are executing the same watch, at nearly the same time on the same schedule. So instead of getting executed once every 10 minutes (for example), the watch gets executed twice every 10 minutes.Aside from the message above, I haven't seen anything relevant in the logs. Restarting the nodes solves the problem.
Steps to Reproduce
Unknown
Logs (if relevant)
No response
The text was updated successfully, but these errors were encountered: