kvs: add defensive checkpoint via sync configuration option #4383

chu11 · 2022-06-29T04:45:09Z

Per discussion in #4301, add a sync = <time> option to the KVS, so that the primary kvs namespace can be checkpointed on a regular basis.

garlick · 2022-06-29T16:05:47Z

On a quick skim, it looks like the timer wakes up even when there has been no kvs activity, and also that it commits actual data to the KVS each time it syncs. I don't think either of those things are desirable or necessary.

Maybe a timer is the wrong approach - could the commit logic check how long it's been since the last sync and just do one if periodic sync is configured AND it's been long enough?

chu11 · 2022-06-29T17:08:42Z

On a quick skim, it looks like the timer wakes up even when there has been no kvs activity, and also that it commits actual data to the KVS each time it syncs. I don't think either of those things are desirable or necessary.

Yup, it does wake up if there hasn't been any activity, but it maintains the last sequence number of the root so that it won't do another commit if unnecessary.

    /* if no changes to root since last sync, don't do it */
    if (ctx->sync_last_rootseq == root->seq)
        return;

Maybe a timer is the wrong approach - could the commit logic check how long it's been since the last sync and just do one if periodic sync is configured AND it's been long enough?

Hmmm, that's an interesting idea. Perhaps if there hasn't been a checkpoint in awhile, just add FLUX_KVS_SYNC to that particular commit? It's worth trying out, see if it's cleaner.

garlick · 2022-06-29T17:20:17Z

It's probably a non starter to have the code periodically commit something, especially something different each time since that contributes to content.sqlite growth.

chu11 · 2022-06-29T18:42:12Z

especially something different each time since that contributes to content.sqlite growth.

Ahhh that's a good point. Can definitely write the same thing over and over again instead.

It's probably a non starter to have the code periodically commit something

As I was trying to experiment with this idea, it occurred to me that adding a FLUX_KVS_SYNC to a kvs transaction probably isn't a good idea. We wouldn't want an error on content.flush or checkpoint to lead to a commit error, especially for a KVS transaction the kvs module doesn't "own". We could add "back out" logic if the sync fails, but things start to get complicated.

Leading to the idea, perhaps using the timer technique but have it just do a content.flush and checkpoint, independent of the commit logic. However, I realized in a prior prototype that that would be racy with anyone using the FLUX_KVS_SYNC flag. Thus, by doing a commit w/ the FLUX_KVS_SYNC flag, all commits aren't racy against other attempts to sync.

I'll continue to ponder, maybe there's another idea I'm not thinking about yet.

garlick · 2022-06-29T18:57:11Z

If a sync fails, something is really going wrong with the system or the backing store, and KVS data is in jeopardy with no real recovery options, so it probably would be fine IMHO to fail a commit in that case.

chu11 · 2022-06-29T19:21:14Z

Hmmm, here's an idea. What if we implemented a hypothetical FLUX_KVS_SYNC_NO_COMMIT flag (this flag could potentially be internal to KVS only). It's a KVS transaction that has no operations, it just performs the flush & checkpoint. I don't think it'd be too hard b/c it'd just be a tweak to the kvstxn state machine to skip all of those states in the middle (although this could be famous last words).

If we keep the timer approach, not that different than what is in this PR once above is implemented.

If we go with the "check how recent last commit was an if it's > sync time ago", we don't have to alter the current transaction. Just add a new KVS transaction.

FWIW, I'm currently less than enthused with the latter approach. Would add a ctx->rank == 0 && strcmp (namespace, KVS_PRIMARY) == 0 && ctx->sync > 0.0 and a clock_getmonotime() if true, into a pretty critical section. It might be a net loss in performance compared to the timer.

garlick · 2022-06-29T20:03:52Z

Oh, duh - the way I proposed (not using a timer) could leave dirty data unsynced longer the configured time if another commit doesn't come along. Sorry about that, your intuition was right that a timer is the way to go!

The spurious wakeup can be avoided by not starting the timer watcher until a commit comes in. E.g. each commit checks if timer is active, and if not , sets the time and starts it. Timer expires, sync happens, and the system could be completely idle until another commit comes in and starts the timer again.

You're the expert on the best way to do this inside the KVS, so take this with a grain of salt but instead of a flag could we just have the timer post a regular commit with no ops (empty txn) and the FLUX_KVS_SYNC flag, and then just ensure that the state machine does the right thing with that if it doesn't already? It seems like it would be nice to allow that anyway, from outside the KVS module.

chu11 · 2022-06-29T21:10:21Z

post a regular commit with no ops (empty txn) and the FLUX_KVS_SYNC flag

I was just thinking about this. For some reason, I felt that a txn must not be empty. But if it is empty, that should totally be allowed. Then this would be easy and I can perhaps add some "state machine jumps" to skip states that are no longer important.

chu11 · 2022-06-30T18:00:22Z

The spurious wakeup can be avoided by not starting the timer watcher until a commit comes in. E.g. each commit checks if timer is active, and if not , sets the time and starts it. Timer expires, sync happens, and the system could be completely idle until another commit comes in and starts the timer again.

So what do we think the average "defensive checkpoint' time length is going to be? B/c I'd have to add a ctx->rank == 0, ctx->sync > 0.0, strcmp (namespace, PRIMARY) check on every commit (and probably a timer check too).

IMO, the extra checking on every commit is a bit much, although I didn't do any testing / measurement yet.

garlick · 2022-06-30T19:28:17Z

Could you just do (on every commit):

if (!timer_flag) {
    flux_timer_watcher_reset (w, sync_time, 0.);
    flux_timer_watcher_start (w);
    timer_flag = true;
}

Then in the timer callback, start the sync and set timer_flag = false.

So you'd get approximately a sync every sync_time as long as there is activity.

If the sync time is less than some threshold we could always bump it up and document this. Say 1m?

chu11 · 2022-06-30T19:38:00Z

Could you just do (on every commit):

You mean not do the checks I was referring to above (or just not do the strcmp for the primary namespace)? I suppose, although we'd probably be activating the timer way more often than we want to?

garlick · 2022-06-30T19:44:08Z

Oh yes, sorry, those checks would be needed. Commits arriving together are normally combined so maybe those checks would effectively be rate limited to the batch timeout? Can't imagine they would amount to much overhead in any case, but famous last words!

chu11 · 2022-06-30T22:34:33Z

re-pushed, per discussion above went with the "no ops transaction" method. Added a test for "no ops" b/c it wasn't actually covered, then I added an optimization to skip several parts of the transaction state machine when there are no ops.

Then started the sync timer depended on if a commit just happened and stopped it after a recent sync completed and if there are no more commits in the pipeline.

garlick

Nice - a few minor comments for you.

garlick · 2022-06-30T22:52:52Z

src/modules/kvs/kvs.c

+    if (kvstxn_mgr_ready_transaction_count (root->ktm) == 0) {
+        flux_watcher_stop (ctx->sync_w);
+        ctx->sync_timer_running = false;
+    }


Would it be clearer to use a one-shot timer, and start it after a commit completes instead of when one is queued? Then you could just set the running flag false here unconditionally and avoid the kvsroot check.

Hmmmm, ok, I see your point, instead of the root check here and the queue-ing checks everywhere else, effectively do some timer math to see if the timer should restarted after every commit. That might be simpler, lemme give it a try.

As an aside, I realized I do have a corner case, in that the current code guarantees to checkpoint within "sync time" after the first commit (or if the KVS has been non-busy, the first commit in awhile in which the timer is woken up), which is probably not correct. It should be "sync time" after the KVS started (or sync immediately if there hasn't been a sync in > "sync time").

Also, it occurred to me, to remove the wasteful "strcmp" on every namespace string, I could just add a special flag in the kvsroot struct to indicate "i am the primary namespace". I think that'd be useful.

garlick · 2022-06-30T22:56:35Z

src/modules/kvs/kvs.c

+
+    if (flux_rpc_get (f, NULL) < 0)
+        flux_log_error (ctx->h, "%s: flux_rpc_get", __FUNCTION__);
+    flux_future_destroy (ctx->sync_f);


Might want to use just f in the destroy to be consistent with the flux_rpc_get() call.

garlick · 2022-06-30T23:08:14Z

src/modules/kvs/kvs.c

+
+    /* N.B. no actual changes in this transaction */
+    if (!(txn = flux_kvs_txn_create ())) {
+        flux_log_error (ctx->h, "%s: flux_kvs_txn_create", __FUNCTION__);


The log messages in this function are all equally unlikely and don't provide much info. Maybe all the failure cases could use one better message like "periodic checkpoint setup failed: ..."

chu11 · 2022-07-01T18:06:48Z

re-pushed

re-did the sync logic, doing:
- one shot timers
- start timer after commit completes vs when commit queued up
- fix logic to ensure timer launched earlier if necessary
add optimization commit to remove constnat strcmp() on the KVS primary namespace name
add flux-config-kvs manpage, which i just completely forgot
minor cleanups per comments above

garlick · 2022-07-01T18:21:28Z

src/modules/kvs/kvs.c

+        if (t < ctx->sync)
+            after = ctx->sync - t;
+        /* else we haven't synced in a long time, so set timer
+         * to set off immediately (i.e. after = 0.0)
+         */


This seems a bit weird to force a sync on the first commit after the system has been idle, when this commit is likely to be immediately followed by others if it 's a job (the common case). Maybe we could describe the sync parameter as "the maximum length of time dirty KVS data will remain in memory before being flushed to the backing store" and then just set the timer to the fixed configured value?

That seems more intuitive to me.

Is the comment perhaps misleading? I initialize last_sync to when the module was loaded, so the first sync should occur "sync period" time after the first commit.

Maybe I'm confused, or maybe I wasn't clear - I thought t was tracking the time since the last sync and would shorten the timer according to how long it's been since the last sync. So for example the sync period is 30m and the system is idle for a day and then someone runs a job, the first commit would trigger a sync. But since nothing else has been committed, there really isn't any increased urgency to do the sync right then.

If instead we think about the sync time period as placing an upper bound on how long after a change we wait to checkpoint it, then we would just start the timer with a fixed duration when ever something changes.

ok, i see your point now. So instead of "checkpoint every X time", you're thinking it should be "checkpoint if anything older than X time"

chu11 · 2022-07-01T23:11:42Z

re-pushed, tweaking sync logic per @garlick comments above. Tweaked KVS manpage language as a result. As fixed up some manpage generation fail that I messed up.

garlick · 2022-07-01T23:20:43Z

src/modules/kvs/kvs.c

+        flux_log_error (ctx->h, "%s: flux_rpc_get", __FUNCTION__);
+    flux_future_destroy (f);
+    ctx->sync_f = NULL;
+    ctx->sync_started = false;


Should this be set false in sync_cb()? If another commit completes before an in-progress sync does, then it wouldn't start the timer again, and could sit there until another commit comes along.

I don't think so. The timer could be restarted before the last sync completes. Here's the way I think about it. Here's the transaction queue where a bunch of commits come in and are put on the queue.

putA -> putB -> putC -> NULL

putA finishes, leading to the sync timer to be started. At some point in the future the sync_cb() will be called and add a transaction to the end of the queue to do the sync. So perhaps its looks like this at that point (putB has completed before the sync was added):

putC -> putD -> putE -> sync -> NULL

if we set sync_started to false in sync_cb(), that means the timer can be restarted after putC completes, which is not what we want. We'd want the sync timer to be restarted on a putF that will come later after the sync.

But your comment does make me think, is it possible a putF could complete before the sync_started flag is set to false in the continuation? The odds are low, but I'm not sure if it could be guaranteed. There's a lot of reactor stuff going on in the background.

I suppose the safe thing is sync_started would maybe have to be set to false when the sync is simply completed, perhaps via some callback mechanism.

These corner cases are tough to think about wrap around, and I definitely had to ponder them a bit. Is it worth considering going back to something closer to the initial implementation I proposed? Where its simply a timer that sets off every X time? It largely avoids thinking about a lot of these corner cases.

garlick · 2022-07-02T18:27:29Z

I think it is safe as long as 1) timer is running or started when each commit finishes, and 2) timer is stopped when sync is started. When a sync starts it is capturing the root ref at that moment, so it needs to be possible for a commit that occurs right after that to start the timer again. Does that make sense? Sorry all I have is my phone this weekend and so I might be missing some detail I would catch if I could review code....

…

On Fri, Jul 1, 2022, 9:22 PM Al Chu ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/modules/kvs/kvs.c <#4383 (comment)> : > @@ -1223,6 +1243,47 @@ static int heartbeat_root_cb (struct kvsroot *root, void *arg) return 0; } +static void sync_continuation (flux_future_t *f, void *arg) +{ + struct kvs_ctx *ctx = arg; + + if (flux_rpc_get (f, NULL) < 0) + flux_log_error (ctx->h, "%s: flux_rpc_get", __FUNCTION__); + flux_future_destroy (f); + ctx->sync_f = NULL; + ctx->sync_started = false; I don't think so. The timer could be restarted before the last sync completes. Here's the way I think about it. Here's the transaction queue where a bunch of commits come in and are put on the queue. putA -> putB -> putC -> NULL putA finishes, leading to the sync timer to be started. At some point in the future the sync_cb() will be called and add a transaction to the end of the queue to do the sync. So perhaps its looks like this at that point ( putB has completed before the sync was added): putC -> putD -> putE -> sync -> NULL if we set sync_started to false in sync_cb(), that means the timer can be restarted after putC completes, which is not what we want. We'd want the sync timer to be restarted on a putF that will come later after the sync. But your comment does make me think, is it possible a putF could complete before the sync_started flag is set to false in the continuation? The odds are low, but I'm not sure if it could be guaranteed. There's a lot of reactor stuff going on in the background. I suppose the safe thing is sync_started would maybe have to be set to false when the sync is simply completed, perhaps via some callback mechanism. These corner cases are tough to think about wrap around, and I definitely had to ponder them a bit. Is it worth considering going back to something closer to the initial implementation I proposed? Where its simply a timer that sets off every X time? It largely avoids thinking about a lot of these corner cases. — Reply to this email directly, view it on GitHub <#4383 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABJPW5SRJH5C5TTAZYYUIDVR67ZLANCNFSM52EKWOXA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

chu11 · 2022-07-04T22:53:17Z

re-pushed, missed a "&&" chain in a test

src/modules/kvs/kvs.c

garlick · 2022-07-05T17:38:45Z

src/modules/kvs/kvs.c

@@ -1053,6 +1072,7 @@ static void kvstxn_apply (kvstxn_t *kt)
        }
        setroot (ctx, root, kvstxn_get_newroot_ref (kt), root->seq + 1);
        setroot_event_send (ctx, root, names, kvstxn_get_keys (kt));
+        start_sync_timer (ctx, root);


By chance is the timer re-armed here by the sync (empty commit) itself? I'm noting that even when the system should be idle I'm getting repeated syncs as though the timer were periodic.

hmm that makes sense, although I'm not sure how this isn't being caught by tests.

ahhh, my tests only ensure that the rootref hasn't changed, which doesn't mean syncs aren't attempted. I assume you're putting in printfs or something and notice the sync callback being called more times than it should?

We need another flag to ensure this doesn't happen, one for the sync timer and one for a sync being active.

But b/c of the racy thing I mentioned above, (a commit could complete before the sync_continuation() is called), I think this would involve more hackery than I'd like. Lets talk on the coffee time about this.

This has certainly exceeded the complexity I originally imagined this PR to be, I'm not sure if you had other ideas as well.

OK. No coffee time today though (it's tuesday). Did you want to chat now? LMK when you're available.

They only race I can imagine is maybe if the sync commit took longer than a concurrent timer.

I didn't entirely understand how the empty commit is handled, but it would seem like it must be publishing the root if it's also starting the timer since that's right next to it in the code? It probably shouldn't be doing that either, so maybe it needs to short circuit out of there earlier?

chu11 · 2022-07-06T00:06:01Z

A side note, @garlick and I chatted and we realized perhaps the problem is that we are using FLUX_KVS_SYNC here. Effectively,
we are trying to manage the "sync" from something that is sort of outside of the KVS transactions's purview.

Instead, will make an attempt to try and implement as a "feature" within the KVS transaction manager.

If that fails, we may fall back and just do a normal cron-like-timer.

chu11 · 2022-07-06T22:08:16Z

Re-pushed, altering the logic of how the syncs are done.

In the KVS transaction manager, add a new function kvstxn_mgr_prepend_sync() that will fake prepend a "sync transaction" onto the head of the transaction queue. Why prepend instead of append? It's to ensure the "sync" is done ASAP, instead of being done at some unknown time in the future if there is a large queue of commits waiting to be processed.

This was important so we don't have to deal with the complexity from before of "what if there's a sync transaction already on the transaction queue". It's now the next transaction all the time, keeping this simpler.

We also add kvstxn_is_sync() as a helper function to determine if the transaction is a special "sync" one initiated by the KVS module.

sync_cb() now calls kvstxn_mgr_prepend_sync() instead of creating a sync commit via flux_kvs_commit(). Why is this important? B/c now we don't need wait for a future to know the sync is complete. The KVS transaction code will know when a sync is done and it is complete (using kvstxn_is_sync()). So we don't need to deal with the complexity of waiting for the future to complete to know the sync is done.
We also have the ability to identify when a special KVS module sync occurs, so that when one occurs, we don't have to perform a setroot() or setroot_event_send() after the transaction is complete. We know this by using the kvstxn_is_sync() function.
The timer is now activated / whenever a commit is done and sync timer is inactive AND a sync is not in progress. The flags / logic of this is easier now that we don't need to deal with a future and deal with syncs at the back end of the queue.
put the cleanups in PR kvs: misc cleanups #4389, so hopefully PR is easier to review as a result.

chu11 · 2022-07-12T15:48:23Z

rebased, re-pushed, squashed, fixing up a few of the things mentioned by @garlick above

garlick

Here are a few comments.

I'm not sure if it's the right thing but I'll throw it out there: would it make sense to create a sync.c source file for this stuff, just to keep it contained?

garlick · 2022-07-12T20:04:30Z

src/modules/kvs/kvstxn.c

@@ -37,13 +37,15 @@
 #define KVSTXN_PROCESSING      0x01
 #define KVSTXN_MERGED          0x02 /* kvstxn is a merger of transactions */
 #define KVSTXN_MERGE_COMPONENT 0x04 /* kvstxn is member of a merger */
+#define KVSTXN_SYNC            0x08 /* is special kvstxn sync */


It might be helpful if this flag was named something like KVSTXN_SYNC_ONLY or KVSTXN_SPECIAL_SYNC or something like that, so it won't look like an internal version of FLUX_KVS_SYNC.

garlick · 2022-07-12T20:08:46Z

src/modules/kvs/kvs.c

+    const char *errstr = "Failed to reconfigure kvs";
+    char errbuf[256];
+    if (flux_conf_reload_decode (msg, &conf) < 0)
+        goto error;


Suggest you just set errstr to NULL for reload_decode failure, since that failure is unlikely and "Failed to reconfigure kvs" would be a misleading error message if it ever did fail.

garlick · 2022-07-12T20:09:53Z

src/modules/kvs/kvs.c

+        snprintf (errbuf,
+                  errbufsize,
+                  "error reading config for kvs: %s",
+                  err.text);


These days we have flux_error_t and errprintf(). Suggest you use those here.

ahhh i happened to copy & paste from resource, perhaps i should update there too while I'm at it.

garlick · 2022-07-12T20:11:01Z

src/modules/kvs/kvs.c

+        /* If this transaction is the special internal sync
+         * transaction, no need to setroot or send event, we just
+         * checkpointed, so do nothing else */
+        if (!kvstxn_is_sync (kt)) {


Might want to flip the two legs of the if here to match the comment for clarity.

garlick · 2022-07-12T21:05:47Z

src/modules/kvs/kvs.c

+        /* setup timer if it wasn't done before */
+        if (sync > 0.0 && !ctx->sync_w) {
+            flux_reactor_t *r = flux_get_reactor (ctx->h);
+            if (!(ctx->sync_w = flux_timer_watcher_create (r,
+                                                           sync,
+                                                           0.0,
+                                                           sync_cb,
+                                                           ctx))) {
+                snprintf (errbuf,
+                          errbufsize,
+                          "error creating sync timer: %s",
+                          flux_strerror (errno));
+                return -1;
+            }
+        }


It may be wise to create the timer watcher unconditionally in the kvs initialization so the module fails to load on failure, instead of needing to handle that error in the reconfig path when the system is running (and sending the error back to flux config reload).

garlick · 2022-07-12T21:07:57Z

src/modules/kvs/kvs.c

+            if (ctx->sync_timer_started) {
+                if (sync > 0.0) {
+                    flux_watcher_stop (ctx->sync_w);
+                    flux_timer_watcher_reset (ctx->sync_w, sync, 0.);
+                    flux_watcher_start (ctx->sync_w);
+                }
+                else { /* sync == 0.0 */
+                    flux_watcher_stop (ctx->sync_w);
+                    ctx->sync_timer_started = false;
+                }
+            }
+            else {
+                if (sync > 0.0
+                    && ctx->sync == 0.0
+                    && ctx->primary_commit_since_last_sync) {
+                    /* on rank 0, primary namespace has to be here */
+                    struct kvsroot *root;
+                    root = kvsroot_mgr_lookup_root (ctx->krm,
+                                                    KVS_PRIMARY_NAMESPACE);
+                    assert (root);
+                    run_sync_timer (ctx);
+                }
+            }
+        }
+
+        ctx->sync = sync;


Could this be expressed more simply, and without the need for the big block comment, as

ctx->sync = sync; flux_watcher_stop (ctx->sync_w); ctx->sync_timer_started = false; if (ctx->primary_commit_since_last_sync) start_sync_timer (ctx);

not sure I can simplify it that much, but you're right that it could be simplified. The above did have some dead code (kvsroot_mgr_lookup_root() looks up something that is not used). If I used start_sync_timer_if_needed(), I'd have to do the lookup.

I did clean it up to the following.

if (ctx->rank == 0 && sync != ctx->sync) { ctx->sync = sync; if (!ctx->sync_txn_submitted) { if (ctx->sync_timer_started) { flux_watcher_stop (ctx->sync_w); if (ctx->sync > 0.0) { flux_timer_watcher_reset (ctx->sync_w, ctx->sync, 0.); flux_watcher_start (ctx->sync_w); } else /* ctx->sync == 0.0 */ ctx->sync_timer_started = false; } else { if (ctx->primary_commit_since_last_sync) start_sync_timer (ctx); /* N.B. this was previously run_sync_timer() */ } } }

garlick · 2022-07-12T21:09:52Z

src/modules/kvs/kvstxn.c

+    char *name = NULL;
+
+    /* N.B. we "fake" prepend to the ready list by storing the sync
+     * transactino in a variable in the kvstxn_mgr_t.  If we really


typo "transactino"

garlick · 2022-07-12T21:11:16Z

src/modules/kvs/kvstxn.c

+
+error:
+    kvstxn_destroy (kt);
+    free (name);


free() might clobber errno, and flux_log_error() is called when this function fails.

garlick · 2022-07-12T21:26:39Z

src/modules/kvs/kvs.c

+    double sync;                /* in seconds */
+    flux_watcher_t *sync_w;
+    bool sync_timer_started;
+    bool sync_txn_submitted;


It may add unnecessary complexity to have the sync_txn_submitted flag when you could just allow another sync to be submitted when one is in progress. It would do no harm, right?

Well I suppose if the sync time was very short it could go badly. Regardless, maybe it would be a good idea to have a lower bound on the configured sync time? Like say 1m?

ahh good catch, i don't allow two "sync prepends" to the kvstxn queue, so it literally would do nothing if a second were attempted to be added by the user.

oh wait ... this was there to catch a corner case. a "sync prepend" may occur while a commit is in progress, so the checkoutpoint would occur right after the current commit in progress is complete.

If we removed this flag, the sync timer could go off for the commit right before the checkpoint is done. I guess we could allow it? But it seems sort of silly to me to allow this. We'd want the sync timer to start for the commit right after the sync/checkpoint.

I'm having a hard time following this, and that's not a good indication that someone coming along later will be able to quickly understand it enough to make changes. I'm trying to think what would cordon this feature off from the rest of the complex kvs stuff, so it could be grokked in isolation.

What if we just create a new source module sync.c and each time we complete a transaction on the root namespace, we call a function like sync_setroot(). Then in that module (without hooking into any other part of the KVS except maybe a configuration callback) we implement the content flush + checkpoint and timer? Maybe sync.c could contain functions that are also used by the FLUX_KVS_SYNC flag handling in the commit code?

Too much of a do-over?

lemme try to split out into another file, see if that will make things clearer.

another thought, would it be worthwhile to go back to the earlier timer based method? Simply wake up every X time and perform a sync using FLUX_KVS_SYNC? So much of the complexity is eliminated, but at the cost of an occasional timer wakeup and the setroot event that goes out every once in awhile.

It seems wrong to broadcast the setroot event when the KVS content has not changed.

Doing this standalone really shouldn't be that bad - in fact, couldn't it just subscribe to the setroot event and use that to start the timer?

garlick · 2022-07-12T21:30:52Z

src/modules/kvs/kvs.c

+    ctx->sync_timer_started = true;
+}
+
+static void start_sync_timer (struct kvs_ctx *ctx, struct kvsroot *root)


The names of these two functions are a bit misleading since neither of them "run" the timer. They both only "start" the timer, one conditionally. How about:

start_sync_timer()

start_sync_timer_if_needed()

or similar?

chu11 · 2022-07-15T04:22:29Z

talked to @garlick, we both agree that this has become more complicated than either of as initially imagined. So we're going to backtrack. We're going to go with the timer loop idea that I originally proposed. But we'll create a special internal flag like KVSTXN_NO_PUBLISH or something like that, where it will not publish the setroot event after a transaction has completed.

we'll come back to this someday if this ends up being a performance issue, which we both agree is unlikely if the user sticks to sensible checkpoint periods (like 30 minutes), but can be a problem if the period goes down (which is where a lot of the complicated corner cases were coming in).

chu11 · 2022-07-18T17:01:18Z

rebased & re-pushed

renamed "sync" config to "checkpoint-period"
went back to timer loop mechanism, wakeup once every X configured time
support a NO_PUBLISH flag in kvstxn to avoid sending out setroot events after an internal checkpoint is done.

garlick · 2022-07-19T23:08:05Z

src/modules/kvs/kvs.c

@@ -1239,6 +1245,60 @@ static void heartbeat_sync_cb (flux_future_t *f, void *arg)
    flux_future_reset (f);
 }

+static void checkpoint_period_cb (flux_reactor_t *r,


Any reason this stuff was not moved to a new source file? It seems like it could be fairly self-contained and kvs.c is already really long.

it seemed just below the threshold for a new file, but it's close. I'll splice out into a new file.

chu11 · 2022-07-22T13:57:48Z

re-pushed, splicing all the checkpoint-period stuff into its own file. There were two tricky parts

the need to update the "work queue" of roots with work in them. So I had to add a callback function to deal with that.
how to deal with the fact this only matters in rank == 0 w/o sticking a ton of if (ctx->rank == 0) everywhere.

I just made it so the checkpoint-period "struct" is set to NULL on rank != 0, and all of the checkpoint-period functions
do nothing when the context pointer is NULL.

garlick

This looks great @chu11. I think breaking the functionality out to its own source file was a good move. At least for a relative outsider to the code base, it's way easier to grok when it can be isolated.

I made some suggestions but they are all pretty superficial at this point.

Oh I'll set this up on my test system and give it a quick "real life" test.

garlick · 2022-07-25T15:28:05Z

src/modules/kvs/kvstxn.c

@@ -55,7 +55,8 @@ struct kvstxn {
    json_t *ops;
    json_t *keys;
    json_t *names;
-    int flags;
+    int flags;                  /* kvs flags from request caller */
+    int api_flags;       /* special kvstxn api flags */


I'm not sure the api_ prefix is communicating that it's an internal flag. Would internal_flags be better?

i wasn't the biggest fan of "api_flags" either. The issue was there is already an "internal_flags" used internally within the kvstxn api (internal means not externally exposed to users).

perhaps could be called kvstxn_flags? perhaps the other flags needs to renamed to op_flags? lemme think about it.

ahh i had a good idea, the "internal_flags" didn't really have to be bitmasks, they were just there as "bools" for certain conditions. So I replaced them with a bunch of bools. Then api_flags could be "internal_flags" instead.

garlick · 2022-07-25T15:41:17Z

src/modules/kvs/kvs.c

+        /* ctx->kcp will not be created on rank != 0, all kvs_checkpoint
+         * functions will do nothing when ctx->kcp is NULL.
+         */
+        ctx->kcp = kvs_checkpoint_create (h,
+                                          NULL, /* set later */
+                                          0.0,  /* default 0.0, set later */
+                                          work_queue_check_append_wrapper,
+                                          ctx);
+        if (!ctx->kcp)
+            goto error;


Since this code is inside an if (ctx->rank == 0) block, maybe the comment is not needed here? Might be better to just document in kvs_checkpoint.h that each function is a no-op if called with a NULL kvs_checkpiont_t argument to avoid excessive rank 0 tests in the code.

garlick · 2022-07-25T15:42:26Z

src/modules/kvs/kvs.c

@@ -2703,6 +2727,30 @@ static void setroot_unpause_request_cb (flux_t *h, flux_msg_handler_t *mh,
        flux_log_error (h, "%s: flux_respond_error", __FUNCTION__);
 }

+static void reload_cb (flux_t *h,


maybe naming this config_reload_cb() would be better in such a large code base as the kvs?

garlick · 2022-07-25T15:45:51Z

src/modules/kvs/kvs_checkpoint.c

+    if (kcp->last_checkpoint_seq == kcp->root_primary->seq)
+        return;
+
+    if (asprintf (&name, "checkpoint-period.%u", kcp->root_primary->seq) < 0


It would probably be better to use a static buffer instead of asprintf() here since the string is of predictable length.

garlick · 2022-07-25T15:46:23Z

src/modules/kvs/kvs_checkpoint.c

+    free (name);
+    json_decref (ops);


Clobbers errno

does it matter in this case? the function returns void.

Oh sorry, I thought I was following a different call stack. This is the timer watcher callback (duh).

Given that the function cannot return an error, maybe it would make sense to rename the goto label to "done".

garlick · 2022-07-25T15:48:31Z

src/modules/kvs/kvs_checkpoint.h

+                                         kvs_checkpoint_txn_cb txn_cb,
+                                         void *txn_cb_arg);
+
+/* will update internal checkpoint_period setting as needed */


I think you can drop "will" in all the comments in this header file.

garlick · 2022-07-25T15:49:53Z

t/t1011-kvs-checkpoint-period.t

+        cat >kvs.toml <<EOF &&
+[kvs]
+checkpoint-period = "1Z"
+EOF


Suggestion: use <<-EOF and the hereis content can be indented wtih tab characters.

garlick · 2022-07-25T19:53:00Z

I set my test system up for 30m checkpoints, put something in the kvs, then after an hour ran flux kvs dump --checkpoint - | tar tvf - and there was my test key. Yay!

chu11 · 2022-07-25T21:48:26Z

re-based and re-pushed with fixes given comments above. Perhaps want to give it one more skim @garlick? The big change is making the previously named internal_flags into three separate bools.

-    int internal_flags;
+    bool processing;            /* kvstxn is being processed */
+    bool merged;                /* kvstxn is a merger of transactions */
+    bool merge_component;       /* kvstxn is member of a merger */

and the previously named api_flags is now internal_flags.

garlick

Looks good after a final skim!

Problem: The kvsroot.h header defines the kvsroot struct, which includes a zlist_t member. However, it does not include the appropriate headers to ensure this type is defined. Solution: Add include of czmq_containers.h into kvsroot.h.

Problem: In the kvstxn code, an internal bitmask of flags is used for "internal" purposes. The use of a bitmask for internal purposes can be confused with other flag parameters that may be passed in by users. Solution: Remove the "internal_flags" and replace with bools representing the conditions the internal code cares about.

Problem: In the future we'd like to add some special options to transactions in the kvstxn API. However, nothing in the api currently supports this. Solution: Add a new "internal_flags" parameter to kvstxn_mgr_add_transaction() which will allow users to set special kvstxn flags. Add a new kvstxn_get_internal_flags() to extract those flags from a transaction. Note that no flags are currently supported.

Problem: We would like to mark some transactions to indicate they should not publish their changes after completion. Solution: Add a new internal flag called KVSTXN_INTERNAL_FLAG_NO_PUBLISH. If set, do not call setroot() or setroot_event_send().

Problem: While there are mechanisms to checkpoint the primary KVS namespace via commands (such as the flux kvs --sync option), there is no configurable way to do it at timed intervals. Solution: Add a "checkpoint-period" KVS configuration option. The configuration option takes a fsd time length as input and will checkpoint the present KVS root at each time interval. Fixes flux-framework#4301

Problem: No documentation for the new [kvs] config options exists. Add man page.

codecov · 2022-07-25T23:16:03Z

Codecov Report

Merging #4383 (7e17fdb) into master (7e17fdb) will not change coverage.
The diff coverage is n/a.

❗ Current head 7e17fdb differs from pull request most recent head 882a320. Consider uploading reports for the commit 882a320 to get more accurate results

@@           Coverage Diff           @@
##           master    #4383   +/-   ##
=======================================
  Coverage   83.32%   83.32%           
=======================================
  Files         395      395           
  Lines       66652    66652           
=======================================
  Hits        55535    55535           
  Misses      11117    11117

chu11 force-pushed the issue4301_defensive_checkpoints branch from ae3fd87 to 826306a Compare June 30, 2022 22:25

garlick reviewed Jun 30, 2022

View reviewed changes

chu11 force-pushed the issue4301_defensive_checkpoints branch from 826306a to 17f2b60 Compare July 1, 2022 18:04

garlick reviewed Jul 1, 2022

View reviewed changes

chu11 force-pushed the issue4301_defensive_checkpoints branch from 17f2b60 to 0c58cae Compare July 1, 2022 23:10

garlick reviewed Jul 1, 2022

View reviewed changes

chu11 force-pushed the issue4301_defensive_checkpoints branch 2 times, most recently from 38cde07 to 8ff0952 Compare July 4, 2022 22:53

garlick reviewed Jul 5, 2022

View reviewed changes

chu11 mentioned this pull request Jul 5, 2022

kvs: misc cleanups #4389

Merged

chu11 force-pushed the issue4301_defensive_checkpoints branch from 8ff0952 to 82723d5 Compare July 6, 2022 21:45

chu11 force-pushed the issue4301_defensive_checkpoints branch from fe8e64e to 239a65b Compare July 12, 2022 15:47

garlick reviewed Jul 12, 2022

View reviewed changes

chu11 mentioned this pull request Jul 13, 2022

Use flux_error_t and errprintf() over char buf and snprintf() #4407

Merged

chu11 mentioned this pull request Jul 15, 2022

kvs: rename kvs.sync target to kvs.wait-version #4410

Merged

chu11 force-pushed the issue4301_defensive_checkpoints branch from 239a65b to 11ba858 Compare July 18, 2022 16:49

garlick reviewed Jul 19, 2022

View reviewed changes

chu11 force-pushed the issue4301_defensive_checkpoints branch from 11ba858 to 6c64cb6 Compare July 22, 2022 13:55

garlick approved these changes Jul 25, 2022

View reviewed changes

chu11 force-pushed the issue4301_defensive_checkpoints branch 2 times, most recently from bb2d01a to 0e32da4 Compare July 25, 2022 21:48

garlick approved these changes Jul 25, 2022

View reviewed changes

chu11 added 7 commits July 25, 2022 15:59

kvs: add required headers into kvsroot.h

3a7637e

Problem: The kvsroot.h header defines the kvsroot struct, which includes a zlist_t member. However, it does not include the appropriate headers to ensure this type is defined. Solution: Add include of czmq_containers.h into kvsroot.h.

kvs: add kvstxn NO_PUBLISH internal flag

5488e6c

Problem: We would like to mark some transactions to indicate they should not publish their changes after completion. Solution: Add a new internal flag called KVSTXN_INTERNAL_FLAG_NO_PUBLISH. If set, do not call setroot() or setroot_event_send().

testsuite: add coverage for KVS checkpoint-period

fc5a10a

flux-config-kvs(5): add new man page

882a320

Problem: No documentation for the new [kvs] config options exists. Add man page.

chu11 force-pushed the issue4301_defensive_checkpoints branch from 0e32da4 to 882a320 Compare July 25, 2022 23:00

chu11 added the merge-when-passing label Jul 26, 2022

mergify bot merged commit 007b4a6 into flux-framework:master Jul 26, 2022

chu11 deleted the issue4301_defensive_checkpoints branch July 26, 2022 04:17

garlick mentioned this pull request Jul 27, 2022

job-manager: improve robustness of max job id recovery on restart #4443

Merged

kvs: add defensive checkpoint via sync configuration option #4383

kvs: add defensive checkpoint via sync configuration option #4383

Conversation

chu11 commented Jun 29, 2022

garlick commented Jun 29, 2022

chu11 commented Jun 29, 2022

garlick commented Jun 29, 2022

chu11 commented Jun 29, 2022

garlick commented Jun 29, 2022

chu11 commented Jun 29, 2022 • edited Loading

garlick commented Jun 29, 2022

chu11 commented Jun 29, 2022

chu11 commented Jun 30, 2022

garlick commented Jun 30, 2022

chu11 commented Jun 30, 2022

garlick commented Jun 30, 2022 • edited Loading

chu11 commented Jun 30, 2022

garlick left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chu11 commented Jul 1, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chu11 commented Jul 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

garlick commented Jul 2, 2022 via email

chu11 commented Jul 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chu11 Jul 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chu11 commented Jul 6, 2022

chu11 commented Jul 6, 2022

chu11 commented Jul 12, 2022

garlick left a comment

Choose a reason for hiding this comment

garlick Jul 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

garlick Jul 12, 2022 • edited Loading

Choose a reason for hiding this comment

chu11 Jul 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chu11 Jul 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chu11 commented Jul 15, 2022 • edited Loading

chu11 commented Jul 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chu11 commented Jul 22, 2022

garlick left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chu11 Jul 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chu11 commented Jun 29, 2022 •

edited

Loading

garlick commented Jun 30, 2022 •

edited

Loading

chu11 commented Jul 1, 2022 •

edited

Loading

chu11 Jul 5, 2022 •

edited

Loading

garlick Jul 12, 2022 •

edited

Loading

garlick Jul 12, 2022 •

edited

Loading

chu11 Jul 14, 2022 •

edited

Loading

chu11 Jul 14, 2022 •

edited

Loading

chu11 commented Jul 15, 2022 •

edited

Loading

garlick left a comment •

edited

Loading

chu11 Jul 25, 2022 •

edited

Loading