-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Periodic Scheduler stuck on March 8th 2AM EST, while servers clock was set to UTC #7289
Comments
Note: This has been a multi-year issue hitting us every year - see #5410 and #3392 Looks like the upstream project (which has been archived a long time ago) got a fix for it since 2016 that was never merged (gorhill/cronexpr#17) |
According to @jippi's assumption, the scheduler is going nuts even if there's one Periodic job that is not in UTC timezone. |
I just got paged into a "fun" outage where a single task running in a localized timezone caused hundreds of other batch tasks to not be dispatched. What can be done to making sure this bug doesn't go the way of the others referenced above? |
Similar to us @the-maldridge =) |
Hi @burdandrei, @jippi and @the-maldridge. Thanks a lot for the detail in this issue and apologies this has both caused impact and been in existence for a while. The team started some discussions yesterday on how best to resolve this and we will again talk about this today. I'll likely close this issue as a duplicate of the already linked #5410, however, I think its worth leaving this open for at least today so that anyone else encountering this problem can quickly and easily find the conversation. |
Thanks for update @jrasell |
We also had this happen in our dev/prod clusters running 0.9.6 on Ubuntu 16.04. Unfortunately fluentd dropped our logs that would have ended up in Kibana and we shutdown the node once it alerted for 0% disk space which replaced it in the autoscaling group. So I don't have much to add for debugging info but I do know that it used up a ton of memory + disk space on the box. Hopefully at least this will help others next year. |
We got hit by this issue as well, what did you do to get things back in a working state? |
@jdebbink We had great luck with removing anything that wasn't running in UTC timezone. After that we did a stop/start on all jobs in batch/periodic mode and ran a monitoring query to figure out what needed an on-demand launch. |
also just restarting the nomad leader made everything work without any job changes :) |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.10.3 (65af1b9)
Operating system and Environment details
Issue
Periodic jobs stopped firing exactly on DST change
Reproduction steps
Save daylight time 🙄
Here's 24-hour logs pattern from nomad server leader. The obvious spike at 2 AM UTC (8AM local browser time) and decrease after nomad leader was restarted and migrated to another server
Will post logs after sanitizing
The text was updated successfully, but these errors were encountered: