[BUG] Periodic Scheduler stuck on March 8th 2AM EST, while servers clock was set to UTC #7289

burdandrei · 2020-03-08T09:56:20Z

Nomad version

Nomad v0.10.3 (65af1b9)

Operating system and Environment details

# cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.4 LTS"

# cat /etc/timezone 
Etc/UTC

Issue

Periodic jobs stopped firing exactly on DST change

Reproduction steps

Save daylight time 🙄

Here's 24-hour logs pattern from nomad server leader. The obvious spike at 2 AM UTC (8AM local browser time) and decrease after nomad leader was restarted and migrated to another server

Will post logs after sanitizing

The text was updated successfully, but these errors were encountered:

burdandrei · 2020-03-08T10:24:57Z

Found the exact message nomad was shouting:
skipping launch of periodic job because job prohibits

Please guide me what other logs/info can be helpfull

jippi · 2020-03-08T10:33:17Z

Note: This has been a multi-year issue hitting us every year - see #5410 and #3392

Looks like the upstream project (which has been archived a long time ago) got a fix for it since 2016 that was never merged (gorhill/cronexpr#17)

burdandrei · 2020-03-08T12:45:06Z

According to @jippi's assumption, the scheduler is going nuts even if there's one Periodic job that is not in UTC timezone.
I checked this, and in affected cluster couple of jobs indeed had America/New_York time zone configured.
Other clusters, that have only UTC time zone crons survived this night well.

the-maldridge · 2020-03-09T07:24:15Z

I just got paged into a "fun" outage where a single task running in a localized timezone caused hundreds of other batch tasks to not be dispatched. What can be done to making sure this bug doesn't go the way of the others referenced above?

burdandrei · 2020-03-09T07:44:46Z

Similar to us @the-maldridge =)
We added a force check of the Periodic jobs timezone for now.
But obviously, when you're running with multi DC, distributed team of developers environment use of timezone is very handy from the developer's perspective

jrasell · 2020-03-09T11:42:13Z

Hi @burdandrei, @jippi and @the-maldridge. Thanks a lot for the detail in this issue and apologies this has both caused impact and been in existence for a while. The team started some discussions yesterday on how best to resolve this and we will again talk about this today. I'll likely close this issue as a duplicate of the already linked #5410, however, I think its worth leaving this open for at least today so that anyone else encountering this problem can quickly and easily find the conversation.

burdandrei · 2020-03-09T13:12:42Z

Thanks for update @jrasell

Dirrk · 2020-03-09T17:43:58Z

We also had this happen in our dev/prod clusters running 0.9.6 on Ubuntu 16.04. Unfortunately fluentd dropped our logs that would have ended up in Kibana and we shutdown the node once it alerted for 0% disk space which replaced it in the autoscaling group. So I don't have much to add for debugging info but I do know that it used up a ton of memory + disk space on the box. Hopefully at least this will help others next year.

jdebbink · 2020-03-09T20:24:11Z

We got hit by this issue as well, what did you do to get things back in a working state?

the-maldridge · 2020-03-09T20:25:33Z

@jdebbink We had great luck with removing anything that wasn't running in UTC timezone. After that we did a stop/start on all jobs in batch/periodic mode and ran a monitoring query to figure out what needed an on-demand launch.

jippi · 2020-03-09T21:14:58Z

also just restarting the nomad leader made everything work without any job changes :)

github-actions · 2022-11-07T02:33:21Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

jrasell added theme/batch Issues related to batch jobs and scheduling type/bug theme/scheduling labels Mar 9, 2020

notnoop mentioned this issue May 7, 2020

Fix Daylight saving transition handling #7894

Merged

notnoop closed this as completed in #7894 May 12, 2020

github-actions bot locked as resolved and limited conversation to collaborators Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Periodic Scheduler stuck on March 8th 2AM EST, while servers clock was set to UTC #7289

[BUG] Periodic Scheduler stuck on March 8th 2AM EST, while servers clock was set to UTC #7289

burdandrei commented Mar 8, 2020 •

edited

Loading

burdandrei commented Mar 8, 2020

jippi commented Mar 8, 2020 •

edited

Loading

burdandrei commented Mar 8, 2020

the-maldridge commented Mar 9, 2020

burdandrei commented Mar 9, 2020

jrasell commented Mar 9, 2020

burdandrei commented Mar 9, 2020

Dirrk commented Mar 9, 2020

jdebbink commented Mar 9, 2020

the-maldridge commented Mar 9, 2020

jippi commented Mar 9, 2020

github-actions bot commented Nov 7, 2022

[BUG] Periodic Scheduler stuck on March 8th 2AM EST, while servers clock was set to UTC #7289

[BUG] Periodic Scheduler stuck on March 8th 2AM EST, while servers clock was set to UTC #7289

Comments

burdandrei commented Mar 8, 2020 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

burdandrei commented Mar 8, 2020

jippi commented Mar 8, 2020 • edited Loading

burdandrei commented Mar 8, 2020

the-maldridge commented Mar 9, 2020

burdandrei commented Mar 9, 2020

jrasell commented Mar 9, 2020

burdandrei commented Mar 9, 2020

Dirrk commented Mar 9, 2020

jdebbink commented Mar 9, 2020

the-maldridge commented Mar 9, 2020

jippi commented Mar 9, 2020

github-actions bot commented Nov 7, 2022

burdandrei commented Mar 8, 2020 •

edited

Loading

jippi commented Mar 8, 2020 •

edited

Loading