Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvs: add defensive checkpoint via sync configuration option #4383

Merged
merged 7 commits into from
Jul 26, 2022

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Jun 29, 2022

Per discussion in #4301, add a sync = <time> option to the KVS, so that the primary kvs namespace can be checkpointed on a regular basis.

@garlick
Copy link
Member

garlick commented Jun 29, 2022

On a quick skim, it looks like the timer wakes up even when there has been no kvs activity, and also that it commits actual data to the KVS each time it syncs. I don't think either of those things are desirable or necessary.

Maybe a timer is the wrong approach - could the commit logic check how long it's been since the last sync and just do one if periodic sync is configured AND it's been long enough?

@chu11
Copy link
Member Author

chu11 commented Jun 29, 2022

On a quick skim, it looks like the timer wakes up even when there has been no kvs activity, and also that it commits actual data to the KVS each time it syncs. I don't think either of those things are desirable or necessary.

Yup, it does wake up if there hasn't been any activity, but it maintains the last sequence number of the root so that it won't do another commit if unnecessary.

    /* if no changes to root since last sync, don't do it */
    if (ctx->sync_last_rootseq == root->seq)
        return;

Maybe a timer is the wrong approach - could the commit logic check how long it's been since the last sync and just do one if periodic sync is configured AND it's been long enough?

Hmmm, that's an interesting idea. Perhaps if there hasn't been a checkpoint in awhile, just add FLUX_KVS_SYNC to that particular commit? It's worth trying out, see if it's cleaner.

@garlick
Copy link
Member

garlick commented Jun 29, 2022

It's probably a non starter to have the code periodically commit something, especially something different each time since that contributes to content.sqlite growth.

@chu11
Copy link
Member Author

chu11 commented Jun 29, 2022

especially something different each time since that contributes to content.sqlite growth.

Ahhh that's a good point. Can definitely write the same thing over and over again instead.

It's probably a non starter to have the code periodically commit something

As I was trying to experiment with this idea, it occurred to me that adding a FLUX_KVS_SYNC to a kvs transaction probably isn't a good idea. We wouldn't want an error on content.flush or checkpoint to lead to a commit error, especially for a KVS transaction the kvs module doesn't "own". We could add "back out" logic if the sync fails, but things start to get complicated.

Leading to the idea, perhaps using the timer technique but have it just do a content.flush and checkpoint, independent of the commit logic. However, I realized in a prior prototype that that would be racy with anyone using the FLUX_KVS_SYNC flag. Thus, by doing a commit w/ the FLUX_KVS_SYNC flag, all commits aren't racy against other attempts to sync.

I'll continue to ponder, maybe there's another idea I'm not thinking about yet.

@garlick
Copy link
Member

garlick commented Jun 29, 2022

If a sync fails, something is really going wrong with the system or the backing store, and KVS data is in jeopardy with no real recovery options, so it probably would be fine IMHO to fail a commit in that case.

@chu11
Copy link
Member Author

chu11 commented Jun 29, 2022

Hmmm, here's an idea. What if we implemented a hypothetical FLUX_KVS_SYNC_NO_COMMIT flag (this flag could potentially be internal to KVS only). It's a KVS transaction that has no operations, it just performs the flush & checkpoint. I don't think it'd be too hard b/c it'd just be a tweak to the kvstxn state machine to skip all of those states in the middle (although this could be famous last words).

If we keep the timer approach, not that different than what is in this PR once above is implemented.

If we go with the "check how recent last commit was an if it's > sync time ago", we don't have to alter the current transaction. Just add a new KVS transaction.

FWIW, I'm currently less than enthused with the latter approach. Would add a ctx->rank == 0 && strcmp (namespace, KVS_PRIMARY) == 0 && ctx->sync > 0.0 and a clock_getmonotime() if true, into a pretty critical section. It might be a net loss in performance compared to the timer.

@garlick
Copy link
Member

garlick commented Jun 29, 2022

Oh, duh - the way I proposed (not using a timer) could leave dirty data unsynced longer the configured time if another commit doesn't come along. Sorry about that, your intuition was right that a timer is the way to go!

The spurious wakeup can be avoided by not starting the timer watcher until a commit comes in. E.g. each commit checks if timer is active, and if not , sets the time and starts it. Timer expires, sync happens, and the system could be completely idle until another commit comes in and starts the timer again.

You're the expert on the best way to do this inside the KVS, so take this with a grain of salt but instead of a flag could we just have the timer post a regular commit with no ops (empty txn) and the FLUX_KVS_SYNC flag, and then just ensure that the state machine does the right thing with that if it doesn't already? It seems like it would be nice to allow that anyway, from outside the KVS module.

@chu11
Copy link
Member Author

chu11 commented Jun 29, 2022

post a regular commit with no ops (empty txn) and the FLUX_KVS_SYNC flag

I was just thinking about this. For some reason, I felt that a txn must not be empty. But if it is empty, that should totally be allowed. Then this would be easy and I can perhaps add some "state machine jumps" to skip states that are no longer important.

@chu11
Copy link
Member Author

chu11 commented Jun 30, 2022

The spurious wakeup can be avoided by not starting the timer watcher until a commit comes in. E.g. each commit checks if timer is active, and if not , sets the time and starts it. Timer expires, sync happens, and the system could be completely idle until another commit comes in and starts the timer again.

So what do we think the average "defensive checkpoint' time length is going to be? B/c I'd have to add a ctx->rank == 0, ctx->sync > 0.0, strcmp (namespace, PRIMARY) check on every commit (and probably a timer check too).

IMO, the extra checking on every commit is a bit much, although I didn't do any testing / measurement yet.

@garlick
Copy link
Member

garlick commented Jun 30, 2022

Could you just do (on every commit):

if (!timer_flag) {
    flux_timer_watcher_reset (w, sync_time, 0.);
    flux_timer_watcher_start (w);
    timer_flag = true;
}

Then in the timer callback, start the sync and set timer_flag = false.

So you'd get approximately a sync every sync_time as long as there is activity.

If the sync time is less than some threshold we could always bump it up and document this. Say 1m?

@chu11
Copy link
Member Author

chu11 commented Jun 30, 2022

Could you just do (on every commit):

You mean not do the checks I was referring to above (or just not do the strcmp for the primary namespace)? I suppose, although we'd probably be activating the timer way more often than we want to?

@garlick
Copy link
Member

garlick commented Jun 30, 2022

Oh yes, sorry, those checks would be needed. Commits arriving together are normally combined so maybe those checks would effectively be rate limited to the batch timeout? Can't imagine they would amount to much overhead in any case, but famous last words!

@chu11 chu11 force-pushed the issue4301_defensive_checkpoints branch from ae3fd87 to 826306a Compare June 30, 2022 22:25
@chu11
Copy link
Member Author

chu11 commented Jun 30, 2022

re-pushed, per discussion above went with the "no ops transaction" method. Added a test for "no ops" b/c it wasn't actually covered, then I added an optimization to skip several parts of the transaction state machine when there are no ops.

Then started the sync timer depended on if a commit just happened and stopped it after a recent sync completed and if there are no more commits in the pipeline.

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice - a few minor comments for you.

Comment on lines 1249 to 1252
if (kvstxn_mgr_ready_transaction_count (root->ktm) == 0) {
flux_watcher_stop (ctx->sync_w);
ctx->sync_timer_running = false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be clearer to use a one-shot timer, and start it after a commit completes instead of when one is queued? Then you could just set the running flag false here unconditionally and avoid the kvsroot check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, ok, I see your point, instead of the root check here and the queue-ing checks everywhere else, effectively do some timer math to see if the timer should restarted after every commit. That might be simpler, lemme give it a try.

As an aside, I realized I do have a corner case, in that the current code guarantees to checkpoint within "sync time" after the first commit (or if the KVS has been non-busy, the first commit in awhile in which the timer is woken up), which is probably not correct. It should be "sync time" after the KVS started (or sync immediately if there hasn't been a sync in > "sync time").

Also, it occurred to me, to remove the wasteful "strcmp" on every namespace string, I could just add a special flag in the kvsroot struct to indicate "i am the primary namespace". I think that'd be useful.


if (flux_rpc_get (f, NULL) < 0)
flux_log_error (ctx->h, "%s: flux_rpc_get", __FUNCTION__);
flux_future_destroy (ctx->sync_f);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to use just f in the destroy to be consistent with the flux_rpc_get() call.


/* N.B. no actual changes in this transaction */
if (!(txn = flux_kvs_txn_create ())) {
flux_log_error (ctx->h, "%s: flux_kvs_txn_create", __FUNCTION__);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log messages in this function are all equally unlikely and don't provide much info. Maybe all the failure cases could use one better message like "periodic checkpoint setup failed: ..."

@chu11 chu11 force-pushed the issue4301_defensive_checkpoints branch from 826306a to 17f2b60 Compare July 1, 2022 18:04
@chu11
Copy link
Member Author

chu11 commented Jul 1, 2022

re-pushed

  • re-did the sync logic, doing:
    • one shot timers
    • start timer after commit completes vs when commit queued up
    • fix logic to ensure timer launched earlier if necessary
  • add optimization commit to remove constnat strcmp() on the KVS primary namespace name
  • add flux-config-kvs manpage, which i just completely forgot
  • minor cleanups per comments above

Comment on lines 924 to 928
if (t < ctx->sync)
after = ctx->sync - t;
/* else we haven't synced in a long time, so set timer
* to set off immediately (i.e. after = 0.0)
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit weird to force a sync on the first commit after the system has been idle, when this commit is likely to be immediately followed by others if it 's a job (the common case). Maybe we could describe the sync parameter as "the maximum length of time dirty KVS data will remain in memory before being flushed to the backing store" and then just set the timer to the fixed configured value?

That seems more intuitive to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the comment perhaps misleading? I initialize last_sync to when the module was loaded, so the first sync should occur "sync period" time after the first commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm confused, or maybe I wasn't clear - I thought t was tracking the time since the last sync and would shorten the timer according to how long it's been since the last sync. So for example the sync period is 30m and the system is idle for a day and then someone runs a job, the first commit would trigger a sync. But since nothing else has been committed, there really isn't any increased urgency to do the sync right then.

If instead we think about the sync time period as placing an upper bound on how long after a change we wait to checkpoint it, then we would just start the timer with a fixed duration when ever something changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i see your point now. So instead of "checkpoint every X time", you're thinking it should be "checkpoint if anything older than X time"

@chu11 chu11 force-pushed the issue4301_defensive_checkpoints branch from 17f2b60 to 0c58cae Compare July 1, 2022 23:10
@chu11
Copy link
Member Author

chu11 commented Jul 1, 2022

re-pushed, tweaking sync logic per @garlick comments above. Tweaked KVS manpage language as a result. As fixed up some manpage generation fail that I messed up.

flux_log_error (ctx->h, "%s: flux_rpc_get", __FUNCTION__);
flux_future_destroy (f);
ctx->sync_f = NULL;
ctx->sync_started = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be set false in sync_cb()? If another commit completes before an in-progress sync does, then it wouldn't start the timer again, and could sit there until another commit comes along.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. The timer could be restarted before the last sync completes. Here's the way I think about it. Here's the transaction queue where a bunch of commits come in and are put on the queue.

putA -> putB -> putC -> NULL

putA finishes, leading to the sync timer to be started. At some point in the future the sync_cb() will be called and add a transaction to the end of the queue to do the sync. So perhaps its looks like this at that point (putB has completed before the sync was added):

putC -> putD -> putE -> sync -> NULL

if we set sync_started to false in sync_cb(), that means the timer can be restarted after putC completes, which is not what we want. We'd want the sync timer to be restarted on a putF that will come later after the sync.

But your comment does make me think, is it possible a putF could complete before the sync_started flag is set to false in the continuation? The odds are low, but I'm not sure if it could be guaranteed. There's a lot of reactor stuff going on in the background.

I suppose the safe thing is sync_started would maybe have to be set to false when the sync is simply completed, perhaps via some callback mechanism.

These corner cases are tough to think about wrap around, and I definitely had to ponder them a bit. Is it worth considering going back to something closer to the initial implementation I proposed? Where its simply a timer that sets off every X time? It largely avoids thinking about a lot of these corner cases.

@garlick
Copy link
Member

garlick commented Jul 2, 2022 via email

@chu11 chu11 force-pushed the issue4301_defensive_checkpoints branch 2 times, most recently from 38cde07 to 8ff0952 Compare July 4, 2022 22:53
@chu11
Copy link
Member Author

chu11 commented Jul 4, 2022

re-pushed, missed a "&&" chain in a test

src/modules/kvs/kvs.c Show resolved Hide resolved
@@ -1053,6 +1072,7 @@ static void kvstxn_apply (kvstxn_t *kt)
}
setroot (ctx, root, kvstxn_get_newroot_ref (kt), root->seq + 1);
setroot_event_send (ctx, root, names, kvstxn_get_keys (kt));
start_sync_timer (ctx, root);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By chance is the timer re-armed here by the sync (empty commit) itself? I'm noting that even when the system should be idle I'm getting repeated syncs as though the timer were periodic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm that makes sense, although I'm not sure how this isn't being caught by tests.

Copy link
Member Author

@chu11 chu11 Jul 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhh, my tests only ensure that the rootref hasn't changed, which doesn't mean syncs aren't attempted. I assume you're putting in printfs or something and notice the sync callback being called more times than it should?

We need another flag to ensure this doesn't happen, one for the sync timer and one for a sync being active.

But b/c of the racy thing I mentioned above, (a commit could complete before the sync_continuation() is called), I think this would involve more hackery than I'd like. Lets talk on the coffee time about this.

This has certainly exceeded the complexity I originally imagined this PR to be, I'm not sure if you had other ideas as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. No coffee time today though (it's tuesday). Did you want to chat now? LMK when you're available.

They only race I can imagine is maybe if the sync commit took longer than a concurrent timer.

I didn't entirely understand how the empty commit is handled, but it would seem like it must be publishing the root if it's also starting the timer since that's right next to it in the code? It probably shouldn't be doing that either, so maybe it needs to short circuit out of there earlier?

@chu11 chu11 mentioned this pull request Jul 5, 2022
@chu11
Copy link
Member Author

chu11 commented Jul 6, 2022

A side note, @garlick and I chatted and we realized perhaps the problem is that we are using FLUX_KVS_SYNC here. Effectively,
we are trying to manage the "sync" from something that is sort of outside of the KVS transactions's purview.

Instead, will make an attempt to try and implement as a "feature" within the KVS transaction manager.

If that fails, we may fall back and just do a normal cron-like-timer.

@chu11 chu11 force-pushed the issue4301_defensive_checkpoints branch from 8ff0952 to 82723d5 Compare July 6, 2022 21:45
@chu11
Copy link
Member Author

chu11 commented Jul 6, 2022

Re-pushed, altering the logic of how the syncs are done.

  1. In the KVS transaction manager, add a new function kvstxn_mgr_prepend_sync() that will fake prepend a "sync transaction" onto the head of the transaction queue. Why prepend instead of append? It's to ensure the "sync" is done ASAP, instead of being done at some unknown time in the future if there is a large queue of commits waiting to be processed.

This was important so we don't have to deal with the complexity from before of "what if there's a sync transaction already on the transaction queue". It's now the next transaction all the time, keeping this simpler.

We also add kvstxn_is_sync() as a helper function to determine if the transaction is a special "sync" one initiated by the KVS module.

  1. sync_cb() now calls kvstxn_mgr_prepend_sync() instead of creating a sync commit via flux_kvs_commit(). Why is this important? B/c now we don't need wait for a future to know the sync is complete. The KVS transaction code will know when a sync is done and it is complete (using kvstxn_is_sync()). So we don't need to deal with the complexity of waiting for the future to complete to know the sync is done.

  2. We also have the ability to identify when a special KVS module sync occurs, so that when one occurs, we don't have to perform a setroot() or setroot_event_send() after the transaction is complete. We know this by using the kvstxn_is_sync() function.

  3. The timer is now activated / whenever a commit is done and sync timer is inactive AND a sync is not in progress. The flags / logic of this is easier now that we don't need to deal with a future and deal with syncs at the back end of the queue.

  4. put the cleanups in PR kvs: misc cleanups #4389, so hopefully PR is easier to review as a result.

@chu11 chu11 force-pushed the issue4301_defensive_checkpoints branch from fe8e64e to 239a65b Compare July 12, 2022 15:47
@chu11
Copy link
Member Author

chu11 commented Jul 12, 2022

rebased, re-pushed, squashed, fixing up a few of the things mentioned by @garlick above

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are a few comments.

I'm not sure if it's the right thing but I'll throw it out there: would it make sense to create a sync.c source file for this stuff, just to keep it contained?

@@ -37,13 +37,15 @@
#define KVSTXN_PROCESSING 0x01
#define KVSTXN_MERGED 0x02 /* kvstxn is a merger of transactions */
#define KVSTXN_MERGE_COMPONENT 0x04 /* kvstxn is member of a merger */
#define KVSTXN_SYNC 0x08 /* is special kvstxn sync */
Copy link
Member

@garlick garlick Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be helpful if this flag was named something like KVSTXN_SYNC_ONLY or KVSTXN_SPECIAL_SYNC or something like that, so it won't look like an internal version of FLUX_KVS_SYNC.

Comment on lines 2873 to 2739
const char *errstr = "Failed to reconfigure kvs";
char errbuf[256];
if (flux_conf_reload_decode (msg, &conf) < 0)
goto error;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest you just set errstr to NULL for reload_decode failure, since that failure is unlikely and "Failed to reconfigure kvs" would be a misleading error message if it ever did fail.

Comment on lines 2769 to 2772
snprintf (errbuf,
errbufsize,
"error reading config for kvs: %s",
err.text);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These days we have flux_error_t and errprintf(). Suggest you use those here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhh i happened to copy & paste from resource, perhaps i should update there too while I'm at it.

Comment on lines 1078 to 1081
/* If this transaction is the special internal sync
* transaction, no need to setroot or send event, we just
* checkpointed, so do nothing else */
if (!kvstxn_is_sync (kt)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to flip the two legs of the if here to match the comment for clarity.

Comment on lines 2801 to 2815
/* setup timer if it wasn't done before */
if (sync > 0.0 && !ctx->sync_w) {
flux_reactor_t *r = flux_get_reactor (ctx->h);
if (!(ctx->sync_w = flux_timer_watcher_create (r,
sync,
0.0,
sync_cb,
ctx))) {
snprintf (errbuf,
errbufsize,
"error creating sync timer: %s",
flux_strerror (errno));
return -1;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be wise to create the timer watcher unconditionally in the kvs initialization so the module fails to load on failure, instead of needing to handle that error in the reconfig path when the system is running (and sending the error back to flux config reload).

Comment on lines 2835 to 2860
if (ctx->sync_timer_started) {
if (sync > 0.0) {
flux_watcher_stop (ctx->sync_w);
flux_timer_watcher_reset (ctx->sync_w, sync, 0.);
flux_watcher_start (ctx->sync_w);
}
else { /* sync == 0.0 */
flux_watcher_stop (ctx->sync_w);
ctx->sync_timer_started = false;
}
}
else {
if (sync > 0.0
&& ctx->sync == 0.0
&& ctx->primary_commit_since_last_sync) {
/* on rank 0, primary namespace has to be here */
struct kvsroot *root;
root = kvsroot_mgr_lookup_root (ctx->krm,
KVS_PRIMARY_NAMESPACE);
assert (root);
run_sync_timer (ctx);
}
}
}

ctx->sync = sync;
Copy link
Member

@garlick garlick Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be expressed more simply, and without the need for the big block comment, as

ctx->sync = sync;
flux_watcher_stop (ctx->sync_w);
ctx->sync_timer_started = false;
if (ctx->primary_commit_since_last_sync)
    start_sync_timer (ctx);

Copy link
Member Author

@chu11 chu11 Jul 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I can simplify it that much, but you're right that it could be simplified. The above did have some dead code (kvsroot_mgr_lookup_root() looks up something that is not used). If I used start_sync_timer_if_needed(), I'd have to do the lookup.

I did clean it up to the following.

    if (ctx->rank == 0 && sync != ctx->sync) {
        ctx->sync = sync;
        if (!ctx->sync_txn_submitted) {
            if (ctx->sync_timer_started) {
                flux_watcher_stop (ctx->sync_w);
                if (ctx->sync > 0.0) {
                    flux_timer_watcher_reset (ctx->sync_w, ctx->sync, 0.);
                    flux_watcher_start (ctx->sync_w);
                }
                else /* ctx->sync == 0.0 */
                    ctx->sync_timer_started = false;
            }
            else {
                if (ctx->primary_commit_since_last_sync)
                    start_sync_timer (ctx); /* N.B. this was previously run_sync_timer() */
            }
        }
    }

char *name = NULL;

/* N.B. we "fake" prepend to the ready list by storing the sync
* transactino in a variable in the kvstxn_mgr_t. If we really
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo "transactino"


error:
kvstxn_destroy (kt);
free (name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

free() might clobber errno, and flux_log_error() is called when this function fails.

double sync; /* in seconds */
flux_watcher_t *sync_w;
bool sync_timer_started;
bool sync_txn_submitted;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may add unnecessary complexity to have the sync_txn_submitted flag when you could just allow another sync to be submitted when one is in progress. It would do no harm, right?

Well I suppose if the sync time was very short it could go badly. Regardless, maybe it would be a good idea to have a lower bound on the configured sync time? Like say 1m?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh good catch, i don't allow two "sync prepends" to the kvstxn queue, so it literally would do nothing if a second were attempted to be added by the user.

Copy link
Member Author

@chu11 chu11 Jul 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait ... this was there to catch a corner case. a "sync prepend" may occur while a commit is in progress, so the checkoutpoint would occur right after the current commit in progress is complete.

If we removed this flag, the sync timer could go off for the commit right before the checkpoint is done. I guess we could allow it? But it seems sort of silly to me to allow this. We'd want the sync timer to start for the commit right after the sync/checkpoint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time following this, and that's not a good indication that someone coming along later will be able to quickly understand it enough to make changes. I'm trying to think what would cordon this feature off from the rest of the complex kvs stuff, so it could be grokked in isolation.

What if we just create a new source module sync.c and each time we complete a transaction on the root namespace, we call a function like sync_setroot(). Then in that module (without hooking into any other part of the KVS except maybe a configuration callback) we implement the content flush + checkpoint and timer? Maybe sync.c could contain functions that are also used by the FLUX_KVS_SYNC flag handling in the commit code?

Too much of a do-over?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lemme try to split out into another file, see if that will make things clearer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another thought, would it be worthwhile to go back to the earlier timer based method? Simply wake up every X time and perform a sync using FLUX_KVS_SYNC? So much of the complexity is eliminated, but at the cost of an occasional timer wakeup and the setroot event that goes out every once in awhile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems wrong to broadcast the setroot event when the KVS content has not changed.

Doing this standalone really shouldn't be that bad - in fact, couldn't it just subscribe to the setroot event and use that to start the timer?

ctx->sync_timer_started = true;
}

static void start_sync_timer (struct kvs_ctx *ctx, struct kvsroot *root)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The names of these two functions are a bit misleading since neither of them "run" the timer. They both only "start" the timer, one conditionally. How about:

  • start_sync_timer()
  • start_sync_timer_if_needed()

or similar?

@chu11
Copy link
Member Author

chu11 commented Jul 15, 2022

talked to @garlick, we both agree that this has become more complicated than either of as initially imagined. So we're going to backtrack. We're going to go with the timer loop idea that I originally proposed. But we'll create a special internal flag like KVSTXN_NO_PUBLISH or something like that, where it will not publish the setroot event after a transaction has completed.

we'll come back to this someday if this ends up being a performance issue, which we both agree is unlikely if the user sticks to sensible checkpoint periods (like 30 minutes), but can be a problem if the period goes down (which is where a lot of the complicated corner cases were coming in).

@chu11 chu11 force-pushed the issue4301_defensive_checkpoints branch from 239a65b to 11ba858 Compare July 18, 2022 16:49
@chu11
Copy link
Member Author

chu11 commented Jul 18, 2022

rebased & re-pushed

  • renamed "sync" config to "checkpoint-period"
  • went back to timer loop mechanism, wakeup once every X configured time
  • support a NO_PUBLISH flag in kvstxn to avoid sending out setroot events after an internal checkpoint is done.

@@ -1239,6 +1245,60 @@ static void heartbeat_sync_cb (flux_future_t *f, void *arg)
flux_future_reset (f);
}

static void checkpoint_period_cb (flux_reactor_t *r,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this stuff was not moved to a new source file? It seems like it could be fairly self-contained and kvs.c is already really long.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seemed just below the threshold for a new file, but it's close. I'll splice out into a new file.

@chu11 chu11 force-pushed the issue4301_defensive_checkpoints branch from 11ba858 to 6c64cb6 Compare July 22, 2022 13:55
@chu11
Copy link
Member Author

chu11 commented Jul 22, 2022

re-pushed, splicing all the checkpoint-period stuff into its own file. There were two tricky parts

  1. the need to update the "work queue" of roots with work in them. So I had to add a callback function to deal with that.

  2. how to deal with the fact this only matters in rank == 0 w/o sticking a ton of if (ctx->rank == 0) everywhere.

I just made it so the checkpoint-period "struct" is set to NULL on rank != 0, and all of the checkpoint-period functions
do nothing when the context pointer is NULL.

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @chu11. I think breaking the functionality out to its own source file was a good move. At least for a relative outsider to the code base, it's way easier to grok when it can be isolated.

I made some suggestions but they are all pretty superficial at this point.

Oh I'll set this up on my test system and give it a quick "real life" test.

@@ -55,7 +55,8 @@ struct kvstxn {
json_t *ops;
json_t *keys;
json_t *names;
int flags;
int flags; /* kvs flags from request caller */
int api_flags; /* special kvstxn api flags */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the api_ prefix is communicating that it's an internal flag. Would internal_flags be better?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wasn't the biggest fan of "api_flags" either. The issue was there is already an "internal_flags" used internally within the kvstxn api (internal means not externally exposed to users).

perhaps could be called kvstxn_flags? perhaps the other flags needs to renamed to op_flags? lemme think about it.

Copy link
Member Author

@chu11 chu11 Jul 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh i had a good idea, the "internal_flags" didn't really have to be bitmasks, they were just there as "bools" for certain conditions. So I replaced them with a bunch of bools. Then api_flags could be "internal_flags" instead.

Comment on lines 155 to 161
/* ctx->kcp will not be created on rank != 0, all kvs_checkpoint
* functions will do nothing when ctx->kcp is NULL.
*/
ctx->kcp = kvs_checkpoint_create (h,
NULL, /* set later */
0.0, /* default 0.0, set later */
work_queue_check_append_wrapper,
ctx);
if (!ctx->kcp)
goto error;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this code is inside an if (ctx->rank == 0) block, maybe the comment is not needed here? Might be better to just document in kvs_checkpoint.h that each function is a no-op if called with a NULL kvs_checkpiont_t argument to avoid excessive rank 0 tests in the code.

@@ -2703,6 +2727,30 @@ static void setroot_unpause_request_cb (flux_t *h, flux_msg_handler_t *mh,
flux_log_error (h, "%s: flux_respond_error", __FUNCTION__);
}

static void reload_cb (flux_t *h,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe naming this config_reload_cb() would be better in such a large code base as the kvs?

if (kcp->last_checkpoint_seq == kcp->root_primary->seq)
return;

if (asprintf (&name, "checkpoint-period.%u", kcp->root_primary->seq) < 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be better to use a static buffer instead of asprintf() here since the string is of predictable length.

Comment on lines 156 to 160
free (name);
json_decref (ops);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clobbers errno

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it matter in this case? the function returns void.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, I thought I was following a different call stack. This is the timer watcher callback (duh).

Given that the function cannot return an error, maybe it would make sense to rename the goto label to "done".

kvs_checkpoint_txn_cb txn_cb,
void *txn_cb_arg);

/* will update internal checkpoint_period setting as needed */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can drop "will" in all the comments in this header file.

Comment on lines 44 to 47
cat >kvs.toml <<EOF &&
[kvs]
checkpoint-period = "1Z"
EOF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: use <<-EOF and the hereis content can be indented wtih tab characters.

@garlick
Copy link
Member

garlick commented Jul 25, 2022

I set my test system up for 30m checkpoints, put something in the kvs, then after an hour ran flux kvs dump --checkpoint - | tar tvf - and there was my test key. Yay!

@chu11 chu11 force-pushed the issue4301_defensive_checkpoints branch 2 times, most recently from bb2d01a to 0e32da4 Compare July 25, 2022 21:48
@chu11
Copy link
Member Author

chu11 commented Jul 25, 2022

re-based and re-pushed with fixes given comments above. Perhaps want to give it one more skim @garlick? The big change is making the previously named internal_flags into three separate bools.

-    int internal_flags;
+    bool processing;            /* kvstxn is being processed */
+    bool merged;                /* kvstxn is a merger of transactions */
+    bool merge_component;       /* kvstxn is member of a merger */

and the previously named api_flags is now internal_flags.

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good after a final skim!

chu11 added 7 commits July 25, 2022 15:59
Problem: The kvsroot.h header defines the kvsroot struct, which
includes a zlist_t member.  However, it does not include
the appropriate headers to ensure this type is defined.

Solution: Add include of czmq_containers.h into kvsroot.h.
Problem: In the kvstxn code, an internal bitmask of flags is used
for "internal" purposes.  The use of a bitmask for internal purposes
can be confused with other flag parameters that may be passed in
by users.

Solution: Remove the "internal_flags" and replace with bools representing
the conditions the internal code cares about.
Problem: In the future we'd like to add some special options
to transactions in the kvstxn API.  However, nothing in the
api currently supports this.

Solution: Add a new "internal_flags" parameter to kvstxn_mgr_add_transaction()
which will allow users to set special kvstxn flags.  Add a new
kvstxn_get_internal_flags() to extract those flags from a transaction.

Note that no flags are currently supported.
Problem: We would like to mark some transactions to indicate they
should not publish their changes after completion.

Solution: Add a new internal flag called KVSTXN_INTERNAL_FLAG_NO_PUBLISH.
If set, do not call setroot() or setroot_event_send().
Problem: While there are mechanisms to checkpoint the
primary KVS namespace via commands (such as the flux kvs
--sync option), there is no configurable way to do it at
timed intervals.

Solution: Add a "checkpoint-period" KVS configuration option.
The configuration option takes a fsd time length as input and
will checkpoint the present KVS root at each time interval.

Fixes flux-framework#4301
Problem: No documentation for the new [kvs] config options
exists.

Add man page.
@chu11 chu11 force-pushed the issue4301_defensive_checkpoints branch from 0e32da4 to 882a320 Compare July 25, 2022 23:00
@codecov
Copy link

codecov bot commented Jul 25, 2022

Codecov Report

Merging #4383 (7e17fdb) into master (7e17fdb) will not change coverage.
The diff coverage is n/a.

❗ Current head 7e17fdb differs from pull request most recent head 882a320. Consider uploading reports for the commit 882a320 to get more accurate results

@@           Coverage Diff           @@
##           master    #4383   +/-   ##
=======================================
  Coverage   83.32%   83.32%           
=======================================
  Files         395      395           
  Lines       66652    66652           
=======================================
  Hits        55535    55535           
  Misses      11117    11117           

@mergify mergify bot merged commit 007b4a6 into flux-framework:master Jul 26, 2022
@chu11 chu11 deleted the issue4301_defensive_checkpoints branch July 26, 2022 04:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants