-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job-ingest: ensure duplicate jobids are not issued across instance restart #2820
Conversation
Restarted hung ASAN builder |
ASAN builder failed here - this failure was just called out by @chu11 in #2818 (restarting)
|
Restarted stalled ASAN builder again. I can't imagine how this PR would be increasing the odds of a hang. If I missed fixing module load order in a test, that would tend to cause a hard ENOSYS failure from the new RPCs not a hang. Hmm. |
I think it is like flipping a coin, each time you do it the odds of tails is 1 out of 2, but it still feels strange if you get a string of them in a row... The ASan hangs are starting to get annoying enough to where I'm going to start trying to reproduce locally and get to the bottom of it. |
Recently asan has seemed to hang on me more often. There was a recent PR I think I had to restart it like 3 times. I think travis is just slower now than it used to be. I've added these two recent fixes: and I got one in a this PR for the problem @garlick saw above |
Problem: there is no way to save/restore fluid generator state across a module reload or instance restart. Add a timestamp argument to fluid_init() which specifies a starting timestamp instead of assuming zero. Add two functions for retrieving timestamps that can be used to bootstrap a generator: 1) fluid_get_timestamp() extracts timestamp from a FLUID. It may be useful to save the last allocated FLUID, then use it to restart a generator from the FLUID's timestamp + 1 2) fluid_save_timestamp() obtains an up to date timestamp from a generator. The timestamp can be used to bootstrap a (possibly late-joining) generator peer. Also [cleanup]: to handle the case of a user requesting more than 1024 FLUIDs from one generator ID within a 1ms time quanta, the sleep + recursion in fluid_generate() is replaced with a busy-wait. Recursion in an unlikely code path seems like it is asking for trouble. Update unit test and job-ingest module. Fixes flux-framework#1545
Problem: need to know the largest jobid ever allocated by this instance in order to restart job-ingest FLUID generator after instance restart. Track the largest jobid in ctx->max_jobid, updating it on each job submission. Add an RPC handler "job-manager.getinfo" that returns an object containing hte current max_jobid. Save a checkpoint containing max_jobid to the KVS, and try to initialize it from there when the job manager is (re-)loaded.
Reorder rc1 to fulfill new requirements for FLUID initialization: 1. job-manager - restores max_jobid from kvs 2. job-ingest (rank 0) - asks job-manager for max_jobid 3. job-ingest (rank > 0) - asks job-ingest TBON parent for timestamp
Add a 'getinfo' RPC to job-manager-dummy. Alter loading order of job-ingest module ranks, and job-manager in tests and the 'job' rc personality.
Problem: The fluid generator in the job-ingest module is always initialized with timestamp = 0, which means job-ingest can generate duplicate jobids if the module is reloaded or the instance is restarted. On rank 0, ask job-manager for max_jobid (which job-manager ensures persists across a restart). Extract the timestamp from this jobid and initialize the fluid generator with timestamp + 1. On rank > 0, ask upstream job-ingest for current timestamp and initialize the fluid generator with it. Fixes flux-framework#2816
1de9739
to
85343e7
Compare
Pushed some tests and removed the WIP. Just thought of one more important test, so another force push coming, then this will be ready for review. |
85343e7
to
77ef3e8
Compare
OK, all done with tests. |
Quick naive question before I start reviewing: This only covers successful shutdown? What happens if the rank 0 broker its node crashes? Hate to say this, but since all jobids now must be allocated serially, most of the benefit of FLUIDs is removed. While working on this PR, any thoughts if it would be easier to just go back to monotonic jobids? |
The general problem of how to recover if something goes wrong with the rank 0 broker is something we should discuss. Right now if rank 0 crashes, the kvs checkpoint isn't written so all bets are off.
They are still allocated in parallel as before. In this PR the job manager is just tracking the max jobid it has accepted, and on restart, the ingest module is using that to initialize its FLUID timestamp. |
Got it, thanks for the clarification! That is helpful. |
Codecov Report
@@ Coverage Diff @@
## master #2820 +/- ##
==========================================
- Coverage 81.02% 81.01% -0.02%
==========================================
Files 250 250
Lines 39422 39511 +89
==========================================
+ Hits 31942 32008 +66
- Misses 7480 7503 +23
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I just did a quick test and seems to work well!
Just one comment about perhaps a future issue, not really applicable to this PR.
@@ -105,6 +132,10 @@ int mod_main (flux_t *h, int argc, char **argv) | |||
flux_log_error (h, "flux_reactor_run"); | |||
goto done; | |||
} | |||
if (checkpoint_to_kvs (&ctx) < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we need a way for services that do checkpoints to determine if the current instance is running with a backing store that will be persistent. If not, then checkpoint is fruitless, and could be skipped for faster shutdown.
For now it probably doesn't matter, but when we have a lot of services all doing checkpoints at shutdown, it could add up.
(apologies if this is done already and I just missed it!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good thought - I was also wondering if we should establish a protocol for where these go in the KVS, and what they contain. Dates might be useful for example. An event to trigger writing of checkpoints might be useful. Should we write them out periodically in case instance dies?
Larger topic than this PR I think.
I need to hit the road but plan to check back in this evening if anything comes up. This can go in if you're OK with the lackluster coverage. |
Yeah, mostly error conditions that are not covered. You could get a slight bump by testing that out of order module loading is working as designed (i.e. error is reported to logs), but I don't see the value at this point. |
As discussed in #2816 and #1545, the job-ingest module may issue duplicate jobid's if the module is reloaded on a live system, or if the instance restarts.
This PR adds the ability to start a FLUID generator at an initial timestamp other than zero.
The job manager now tracks the highest valid jobid and allows it to be queried by an RPC. The job manager ensures this value remains valid across instance restart by saving it inside a JSON object to
checkpoint.job-manager
, which it reads back in on startup, if it exists.The job ingest on rank 0 queries this
max_jobid
on startup and uses it to form the initial timestamp for its FLUID generator. Job ingest on rank > 0 query the current timestamp for an upstream job-ingest module and use that to initialize their FLUID generator.This adds constraints on the module loading order which now reflected in rc1 and tests:
Anyway, after all that, systemd can restart flux without messing up the jobid order.
TODO: tests