Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-ingest: ensure duplicate jobids are not issued across instance restart #2820

Merged
merged 8 commits into from
Mar 13, 2020

Conversation

garlick
Copy link
Member

@garlick garlick commented Mar 12, 2020

As discussed in #2816 and #1545, the job-ingest module may issue duplicate jobid's if the module is reloaded on a live system, or if the instance restarts.

This PR adds the ability to start a FLUID generator at an initial timestamp other than zero.

The job manager now tracks the highest valid jobid and allows it to be queried by an RPC. The job manager ensures this value remains valid across instance restart by saving it inside a JSON object to checkpoint.job-manager, which it reads back in on startup, if it exists.

The job ingest on rank 0 queries this max_jobid on startup and uses it to form the initial timestamp for its FLUID generator. Job ingest on rank > 0 query the current timestamp for an upstream job-ingest module and use that to initialize their FLUID generator.

This adds constraints on the module loading order which now reflected in rc1 and tests:

  1. job manager
  2. job ingest rank 0
  3. job ingest rank > 0

Anyway, after all that, systemd can restart flux without messing up the jobid order.

TODO: tests

@garlick
Copy link
Member Author

garlick commented Mar 12, 2020

Restarted hung ASAN builder

@garlick
Copy link
Member Author

garlick commented Mar 12, 2020

ASAN builder failed here - this failure was just called out by @chu11 in #2818 (restarting)

not ok 33 - stdin redirect from /dev/null works with -n
FAIL: t0005-exec.t 33 - stdin redirect from /dev/null works with -n
#	
#	       test_expect_code 0 run_timeout 1 flux exec -n -r0-3 cat
#	
ok 34 - stdin broadcast -- multiple lines
PASS: t0005-exec.t 34 - stdin broadcast -- multiple lines
ok 35 - stdin broadcast -- long lines
PASS: t0005-exec.t 35 - stdin broadcast -- long lines
# failed 1 among 35 test(s)
1..35
ERROR: t0005-exec.t - exited with status 1

@garlick
Copy link
Member Author

garlick commented Mar 12, 2020

Restarted stalled ASAN builder again. I can't imagine how this PR would be increasing the odds of a hang. If I missed fixing module load order in a test, that would tend to cause a hard ENOSYS failure from the new RPCs not a hang. Hmm.

@grondo
Copy link
Contributor

grondo commented Mar 12, 2020

Restarted stalled ASAN builder again. I can't imagine how this PR would be increasing the odds of a hang. If I missed fixing module load order in a test, that would tend to cause a hard ENOSYS failure from the new RPCs not a hang. Hmm.

I think it is like flipping a coin, each time you do it the odds of tails is 1 out of 2, but it still feels strange if you get a string of them in a row...

The ASan hangs are starting to get annoying enough to where I'm going to start trying to reproduce locally and get to the bottom of it.

@chu11
Copy link
Member

chu11 commented Mar 12, 2020

I think it is like flipping a coin, each time you do it the odds of tails is 1 out of 2, but it still feels strange if you get a string of them in a row...

Recently asan has seemed to hang on me more often. There was a recent PR I think I had to restart it like 3 times.

I think travis is just slower now than it used to be. I've added these two recent fixes:

522ba5e
fdf7df4

and I got one in a this PR for the problem @garlick saw above

#2818

garlick added 6 commits March 13, 2020 08:42
Problem: there is no way to save/restore fluid generator
state across a module reload or instance restart.

Add a timestamp argument to fluid_init() which specifies
a starting timestamp instead of assuming zero.

Add two functions for retrieving timestamps that can be used
to bootstrap a generator:
1) fluid_get_timestamp() extracts timestamp from a FLUID.
It may be useful to save the last allocated FLUID, then use
it to restart a generator from the FLUID's timestamp + 1
2) fluid_save_timestamp() obtains an up to date timestamp
from a generator.  The timestamp can be used to bootstrap a
(possibly late-joining) generator peer.

Also [cleanup]: to handle the case of a user requesting more
than 1024 FLUIDs from one generator ID within a 1ms time quanta,
the sleep + recursion in fluid_generate() is replaced with a
busy-wait.  Recursion in an unlikely code path seems like it is
asking for trouble.

Update unit test and job-ingest module.

Fixes flux-framework#1545
Problem: need to know the largest jobid ever allocated
by this instance in order to restart job-ingest FLUID
generator after instance restart.

Track the largest jobid in ctx->max_jobid, updating it on
each job submission.

Add an RPC handler "job-manager.getinfo" that returns an
object containing hte current max_jobid.

Save a checkpoint containing max_jobid to the KVS, and try
to initialize it from there when the job manager is (re-)loaded.
Reorder rc1 to fulfill new requirements for FLUID
initialization:
1. job-manager - restores max_jobid from kvs
2. job-ingest (rank 0) - asks job-manager for max_jobid
3. job-ingest (rank > 0) - asks job-ingest TBON parent for timestamp
Add a 'getinfo' RPC to job-manager-dummy.
Alter loading order of job-ingest module ranks, and
job-manager in tests and the 'job' rc personality.
Problem: The fluid generator in the job-ingest module is always
initialized with timestamp = 0, which means job-ingest can
generate duplicate jobids if the module is reloaded or the
instance is restarted.

On rank 0, ask job-manager for max_jobid (which job-manager ensures
persists across a restart).  Extract the timestamp from this jobid
and initialize the fluid generator with timestamp + 1.

On rank > 0, ask upstream job-ingest for current timestamp
and initialize the fluid generator with it.

Fixes flux-framework#2816
@garlick garlick changed the title WIP: job-ingest: ensure duplicate jobids are not issued across instance restart job-ingest: ensure duplicate jobids are not issued across instance restart Mar 13, 2020
@garlick
Copy link
Member Author

garlick commented Mar 13, 2020

Pushed some tests and removed the WIP. Just thought of one more important test, so another force push coming, then this will be ready for review.

@garlick
Copy link
Member Author

garlick commented Mar 13, 2020

OK, all done with tests.

@grondo
Copy link
Contributor

grondo commented Mar 13, 2020

Quick naive question before I start reviewing:

This only covers successful shutdown? What happens if the rank 0 broker its node crashes?

Hate to say this, but since all jobids now must be allocated serially, most of the benefit of FLUIDs is removed. While working on this PR, any thoughts if it would be easier to just go back to monotonic jobids?

@garlick
Copy link
Member Author

garlick commented Mar 13, 2020

This only covers successful shutdown? What happens if the rank 0 broker its node crashes?

The general problem of how to recover if something goes wrong with the rank 0 broker is something we should discuss. Right now if rank 0 crashes, the kvs checkpoint isn't written so all bets are off.

Hate to say this, but since all jobids now must be allocated serially, most of the benefit of FLUIDs is removed. While working on this PR, any thoughts if it would be easier to just go back to monotonic jobids?

They are still allocated in parallel as before. In this PR the job manager is just tracking the max jobid it has accepted, and on restart, the ingest module is using that to initialize its FLUID timestamp.

@grondo
Copy link
Contributor

grondo commented Mar 13, 2020

Got it, thanks for the clarification! That is helpful.

@codecov-io
Copy link

Codecov Report

Merging #2820 into master will decrease coverage by 0.01%.
The diff coverage is 68.51%.

@@            Coverage Diff             @@
##           master    #2820      +/-   ##
==========================================
- Coverage   81.02%   81.01%   -0.02%     
==========================================
  Files         250      250              
  Lines       39422    39511      +89     
==========================================
+ Hits        31942    32008      +66     
- Misses       7480     7503      +23
Impacted Files Coverage Δ
src/modules/job-manager/submit.c 79.24% <100%> (+0.39%) ⬆️
src/common/libutil/fluid.c 100% <100%> (ø) ⬆️
src/modules/job-ingest/job-ingest.c 71.8% <40%> (-3.77%) ⬇️
src/modules/job-manager/job-manager.c 54.54% <45.45%> (-1.82%) ⬇️
src/modules/job-manager/restart.c 83.49% <87.5%> (+1.3%) ⬆️
src/common/libflux/message.c 82.11% <0%> (-0.14%) ⬇️
src/common/libsubprocess/local.c 79.93% <0%> (+0.34%) ⬆️
src/common/libflux/handle.c 85.71% <0%> (+2%) ⬆️

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I just did a quick test and seems to work well!
Just one comment about perhaps a future issue, not really applicable to this PR.

@@ -105,6 +132,10 @@ int mod_main (flux_t *h, int argc, char **argv)
flux_log_error (h, "flux_reactor_run");
goto done;
}
if (checkpoint_to_kvs (&ctx) < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we need a way for services that do checkpoints to determine if the current instance is running with a backing store that will be persistent. If not, then checkpoint is fruitless, and could be skipped for faster shutdown.

For now it probably doesn't matter, but when we have a lot of services all doing checkpoints at shutdown, it could add up.

(apologies if this is done already and I just missed it!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thought - I was also wondering if we should establish a protocol for where these go in the KVS, and what they contain. Dates might be useful for example. An event to trigger writing of checkpoints might be useful. Should we write them out periodically in case instance dies?

Larger topic than this PR I think.

@garlick
Copy link
Member Author

garlick commented Mar 13, 2020

I need to hit the road but plan to check back in this evening if anything comes up. This can go in if you're OK with the lackluster coverage.

@grondo
Copy link
Contributor

grondo commented Mar 13, 2020

Yeah, mostly error conditions that are not covered. You could get a slight bump by testing that out of order module loading is working as designed (i.e. error is reported to logs), but I don't see the value at this point.

@mergify mergify bot merged commit 17e788a into flux-framework:master Mar 13, 2020
@garlick garlick deleted the jobid_checkpoint branch March 14, 2020 15:28
@garlick garlick mentioned this pull request Mar 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants