job-ingest: ensure duplicate jobids are not issued across instance restart #2820

garlick · 2020-03-12T03:53:56Z

As discussed in #2816 and #1545, the job-ingest module may issue duplicate jobid's if the module is reloaded on a live system, or if the instance restarts.

This PR adds the ability to start a FLUID generator at an initial timestamp other than zero.

The job manager now tracks the highest valid jobid and allows it to be queried by an RPC. The job manager ensures this value remains valid across instance restart by saving it inside a JSON object to checkpoint.job-manager, which it reads back in on startup, if it exists.

The job ingest on rank 0 queries this max_jobid on startup and uses it to form the initial timestamp for its FLUID generator. Job ingest on rank > 0 query the current timestamp for an upstream job-ingest module and use that to initialize their FLUID generator.

This adds constraints on the module loading order which now reflected in rc1 and tests:

job manager
job ingest rank 0
job ingest rank > 0

Anyway, after all that, systemd can restart flux without messing up the jobid order.

TODO: tests

garlick · 2020-03-12T14:51:19Z

Restarted hung ASAN builder

garlick · 2020-03-12T15:58:41Z

ASAN builder failed here - this failure was just called out by @chu11 in #2818 (restarting)

not ok 33 - stdin redirect from /dev/null works with -n
FAIL: t0005-exec.t 33 - stdin redirect from /dev/null works with -n
#	
#	       test_expect_code 0 run_timeout 1 flux exec -n -r0-3 cat
#	
ok 34 - stdin broadcast -- multiple lines
PASS: t0005-exec.t 34 - stdin broadcast -- multiple lines
ok 35 - stdin broadcast -- long lines
PASS: t0005-exec.t 35 - stdin broadcast -- long lines
# failed 1 among 35 test(s)
1..35
ERROR: t0005-exec.t - exited with status 1

garlick · 2020-03-12T17:03:57Z

Restarted stalled ASAN builder again. I can't imagine how this PR would be increasing the odds of a hang. If I missed fixing module load order in a test, that would tend to cause a hard ENOSYS failure from the new RPCs not a hang. Hmm.

grondo · 2020-03-12T17:29:07Z

Restarted stalled ASAN builder again. I can't imagine how this PR would be increasing the odds of a hang. If I missed fixing module load order in a test, that would tend to cause a hard ENOSYS failure from the new RPCs not a hang. Hmm.

I think it is like flipping a coin, each time you do it the odds of tails is 1 out of 2, but it still feels strange if you get a string of them in a row...

The ASan hangs are starting to get annoying enough to where I'm going to start trying to reproduce locally and get to the bottom of it.

chu11 · 2020-03-12T18:14:47Z

I think it is like flipping a coin, each time you do it the odds of tails is 1 out of 2, but it still feels strange if you get a string of them in a row...

Recently asan has seemed to hang on me more often. There was a recent PR I think I had to restart it like 3 times.

I think travis is just slower now than it used to be. I've added these two recent fixes:

522ba5e
fdf7df4

and I got one in a this PR for the problem @garlick saw above

#2818

Problem: there is no way to save/restore fluid generator state across a module reload or instance restart. Add a timestamp argument to fluid_init() which specifies a starting timestamp instead of assuming zero. Add two functions for retrieving timestamps that can be used to bootstrap a generator: 1) fluid_get_timestamp() extracts timestamp from a FLUID. It may be useful to save the last allocated FLUID, then use it to restart a generator from the FLUID's timestamp + 1 2) fluid_save_timestamp() obtains an up to date timestamp from a generator. The timestamp can be used to bootstrap a (possibly late-joining) generator peer. Also [cleanup]: to handle the case of a user requesting more than 1024 FLUIDs from one generator ID within a 1ms time quanta, the sleep + recursion in fluid_generate() is replaced with a busy-wait. Recursion in an unlikely code path seems like it is asking for trouble. Update unit test and job-ingest module. Fixes flux-framework#1545

Problem: need to know the largest jobid ever allocated by this instance in order to restart job-ingest FLUID generator after instance restart. Track the largest jobid in ctx->max_jobid, updating it on each job submission. Add an RPC handler "job-manager.getinfo" that returns an object containing hte current max_jobid. Save a checkpoint containing max_jobid to the KVS, and try to initialize it from there when the job manager is (re-)loaded.

Reorder rc1 to fulfill new requirements for FLUID initialization: 1. job-manager - restores max_jobid from kvs 2. job-ingest (rank 0) - asks job-manager for max_jobid 3. job-ingest (rank > 0) - asks job-ingest TBON parent for timestamp

Add a 'getinfo' RPC to job-manager-dummy. Alter loading order of job-ingest module ranks, and job-manager in tests and the 'job' rc personality.

Problem: The fluid generator in the job-ingest module is always initialized with timestamp = 0, which means job-ingest can generate duplicate jobids if the module is reloaded or the instance is restarted. On rank 0, ask job-manager for max_jobid (which job-manager ensures persists across a restart). Extract the timestamp from this jobid and initialize the fluid generator with timestamp + 1. On rank > 0, ask upstream job-ingest for current timestamp and initialize the fluid generator with it. Fixes flux-framework#2816

garlick · 2020-03-13T16:03:28Z

Pushed some tests and removed the WIP. Just thought of one more important test, so another force push coming, then this will be ready for review.

garlick · 2020-03-13T16:46:35Z

OK, all done with tests.

grondo · 2020-03-13T17:08:38Z

Quick naive question before I start reviewing:

This only covers successful shutdown? What happens if the rank 0 broker its node crashes?

Hate to say this, but since all jobids now must be allocated serially, most of the benefit of FLUIDs is removed. While working on this PR, any thoughts if it would be easier to just go back to monotonic jobids?

garlick · 2020-03-13T17:15:02Z

This only covers successful shutdown? What happens if the rank 0 broker its node crashes?

The general problem of how to recover if something goes wrong with the rank 0 broker is something we should discuss. Right now if rank 0 crashes, the kvs checkpoint isn't written so all bets are off.

Hate to say this, but since all jobids now must be allocated serially, most of the benefit of FLUIDs is removed. While working on this PR, any thoughts if it would be easier to just go back to monotonic jobids?

They are still allocated in parallel as before. In this PR the job manager is just tracking the max jobid it has accepted, and on restart, the ingest module is using that to initialize its FLUID timestamp.

grondo · 2020-03-13T17:18:16Z

Got it, thanks for the clarification! That is helpful.

codecov-io · 2020-03-13T17:19:46Z

Codecov Report

Merging #2820 into master will decrease coverage by 0.01%.
The diff coverage is 68.51%.

@@            Coverage Diff             @@
##           master    #2820      +/-   ##
==========================================
- Coverage   81.02%   81.01%   -0.02%     
==========================================
  Files         250      250              
  Lines       39422    39511      +89     
==========================================
+ Hits        31942    32008      +66     
- Misses       7480     7503      +23

Impacted Files	Coverage Δ
src/modules/job-manager/submit.c	`79.24% <100%> (+0.39%)`	⬆️
src/common/libutil/fluid.c	`100% <100%> (ø)`	⬆️
src/modules/job-ingest/job-ingest.c	`71.8% <40%> (-3.77%)`	⬇️
src/modules/job-manager/job-manager.c	`54.54% <45.45%> (-1.82%)`	⬇️
src/modules/job-manager/restart.c	`83.49% <87.5%> (+1.3%)`	⬆️
src/common/libflux/message.c	`82.11% <0%> (-0.14%)`	⬇️
src/common/libsubprocess/local.c	`79.93% <0%> (+0.34%)`	⬆️
src/common/libflux/handle.c	`85.71% <0%> (+2%)`	⬆️

grondo

LGTM! I just did a quick test and seems to work well!
Just one comment about perhaps a future issue, not really applicable to this PR.

grondo · 2020-03-13T17:30:19Z

src/modules/job-manager/job-manager.c

@@ -105,6 +132,10 @@ int mod_main (flux_t *h, int argc, char **argv)
        flux_log_error (h, "flux_reactor_run");
        goto done;
    }
+    if (checkpoint_to_kvs (&ctx) < 0) {


Perhaps we need a way for services that do checkpoints to determine if the current instance is running with a backing store that will be persistent. If not, then checkpoint is fruitless, and could be skipped for faster shutdown.

For now it probably doesn't matter, but when we have a lot of services all doing checkpoints at shutdown, it could add up.

(apologies if this is done already and I just missed it!)

Good thought - I was also wondering if we should establish a protocol for where these go in the KVS, and what they contain. Dates might be useful for example. An event to trigger writing of checkpoints might be useful. Should we write them out periodically in case instance dies?

Larger topic than this PR I think.

garlick · 2020-03-13T17:50:05Z

I need to hit the road but plan to check back in this evening if anything comes up. This can go in if you're OK with the lackluster coverage.

grondo · 2020-03-13T18:31:18Z

Yeah, mostly error conditions that are not covered. You could get a slight bump by testing that out of order module loading is working as designed (i.e. error is reported to logs), but I don't see the value at this point.

garlick mentioned this pull request Mar 12, 2020

job-exec: minimal support for multiuser exec #2822

Merged

chu11 mentioned this pull request Mar 12, 2020

FAIL: t0005-exec.t 33 - stdin redirect from /dev/null works with -n #2827

Closed

garlick added 6 commits March 13, 2020 08:42

rc1: load job-ingest fluid init order

5ea3114

Reorder rc1 to fulfill new requirements for FLUID initialization: 1. job-manager - restores max_jobid from kvs 2. job-ingest (rank 0) - asks job-manager for max_jobid 3. job-ingest (rank > 0) - asks job-ingest TBON parent for timestamp

testsuite: satisfy new job-ingest requirements

e700a66

Add a 'getinfo' RPC to job-manager-dummy. Alter loading order of job-ingest module ranks, and job-manager in tests and the 'job' rc personality.

testsuite: cover job manager max_jobid

7ec3d09

garlick force-pushed the jobid_checkpoint branch from 1de9739 to 85343e7 Compare March 13, 2020 15:51

garlick changed the title ~~WIP: job-ingest: ensure duplicate jobids are not issued across instance restart~~ job-ingest: ensure duplicate jobids are not issued across instance restart Mar 13, 2020

garlick added 2 commits March 13, 2020 09:37

testsuite: cover ingest FLUID timestamp init

4679cd3

testsuite: cover jobid order across restart

77ef3e8

garlick force-pushed the jobid_checkpoint branch from 85343e7 to 77ef3e8 Compare March 13, 2020 16:43

grondo approved these changes Mar 13, 2020

View reviewed changes

grondo added the merge-when-passing label Mar 13, 2020

mergify bot merged commit 17e788a into flux-framework:master Mar 13, 2020

garlick deleted the jobid_checkpoint branch March 14, 2020 15:28

garlick mentioned this pull request Mar 16, 2020

instance restartability #2650

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job-ingest: ensure duplicate jobids are not issued across instance restart #2820

job-ingest: ensure duplicate jobids are not issued across instance restart #2820

garlick commented Mar 12, 2020

garlick commented Mar 12, 2020

garlick commented Mar 12, 2020

garlick commented Mar 12, 2020

grondo commented Mar 12, 2020

chu11 commented Mar 12, 2020

garlick commented Mar 13, 2020

garlick commented Mar 13, 2020

grondo commented Mar 13, 2020

garlick commented Mar 13, 2020

grondo commented Mar 13, 2020

codecov-io commented Mar 13, 2020

grondo left a comment

grondo Mar 13, 2020

garlick Mar 13, 2020

garlick commented Mar 13, 2020

grondo commented Mar 13, 2020

job-ingest: ensure duplicate jobids are not issued across instance restart #2820

job-ingest: ensure duplicate jobids are not issued across instance restart #2820

Conversation

garlick commented Mar 12, 2020

garlick commented Mar 12, 2020

garlick commented Mar 12, 2020

garlick commented Mar 12, 2020

grondo commented Mar 12, 2020

chu11 commented Mar 12, 2020

garlick commented Mar 13, 2020

garlick commented Mar 13, 2020

grondo commented Mar 13, 2020

garlick commented Mar 13, 2020

grondo commented Mar 13, 2020

codecov-io commented Mar 13, 2020

Codecov Report

grondo left a comment

Choose a reason for hiding this comment

grondo Mar 13, 2020

Choose a reason for hiding this comment

garlick Mar 13, 2020

Choose a reason for hiding this comment

garlick commented Mar 13, 2020

grondo commented Mar 13, 2020