-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvs: add date to kvs-primary checkpoint #4136
kvs: add date to kvs-primary checkpoint #4136
Conversation
hmmmm few builders failing with
not entirely sure how what i did could have mucked up these tests, and only on these builders. hmmm |
I like the idea of json here since we can add other metadata as it comes up without necessarily breaking backwards compatibility. However, breaking backwards compatibility today will require fluke and elmerfudd to blow away their content.sqlite file and start over. It might be good to make an effort to avoid that, for example if unpacking the JSON fails, fall back to the old method? It may also be a good idea to include a version number in the json object to make parsing multiple versions easier going forward. |
Although.... if were were to have an eventlog in the KVS that's posted to on shutdown, right before the checkpoint is written, we could get the same effect and already have tooling for it. I was already thinking about something like that in conjunction with #4128. Can we pause and ponder? |
Seems worthwhile to pause and think about. This was somewhat experimental. |
Thinking about this a bit more, this seems like a perfectly fine solution to #3580. The use of an eventlog for tracking intermediate checkpoints or whatever could be a totally separate thing handled at a higher level than this. (And in fact reading an eventlog within the kvs itself seems like a level violation of some kind). I think you should carry on but do add a version, and do support restoring from the current checkpoint format if that's not too hard. |
662e768
to
f80b1a5
Compare
re-pushed, addressing the comments made by @garlick above |
doh!
will try and redo test using python sqlite3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like code footprint could be greatly reduced here.
@@ -2707,14 +2707,50 @@ static void process_args (struct kvs_ctx *ctx, int ac, char **av) | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit message for f80b1a5:
Maybe should read "...when the primary namespace was checkpointed."
and "When checkpointing the primary namespace..."
src/modules/kvs/kvs.c
Outdated
@@ -2707,14 +2707,50 @@ static void process_args (struct kvs_ctx *ctx, int ac, char **av) | |||
} | |||
} | |||
|
|||
static int checkpoint_get_version0 (flux_t *h, const char *value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop unused flux_t *
parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: don't bother with errno here since there's only ever one value.
src/modules/kvs/kvs.c
Outdated
return 0; | ||
} | ||
|
||
static int checkpoint_get_version1 (flux_t *h, json_t *o, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop unused flux_t *
parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: take string value not json_t *
and do the json_loads()
in this function.
Similarly, don't waste effort on errno here.
src/modules/kvs/kvs.c
Outdated
if (!(o = json_loads (value, 0, NULL))) { | ||
errno = EINVAL; | ||
goto error; | ||
} | ||
strcpy (buf, value); | ||
flux_future_destroy (f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With above changes, this can just be
if (checkpoint_get_version1() < 0 && checkpoint_get_version0() < 0 {
// error interpreting checkpoint object
}
src/modules/kvs/kvs.c
Outdated
static int get_timestamp_now (double *timestamp) | ||
{ | ||
struct timespec ts; | ||
if (clock_gettime (CLOCK_REALTIME, &ts) < 0) | ||
return -1; | ||
*timestamp = (1E-9 * ts.tv_nsec) + ts.tv_sec; | ||
return 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could flux_reactor_now()
be used instead of adding this function?
json_decref (o); | ||
free (value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially clobbers errno.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
json_decref()
and free()
potentially clobber errno.
src/modules/kvs/kvs.c
Outdated
char datestr[128]; | ||
time_t sec = timestamp; | ||
struct tm tm; | ||
if (timestamp > 0.) { | ||
gmtime_r (&sec, &tm); | ||
strftime (datestr, sizeof (datestr), "%FT%T", &tm); | ||
} | ||
else | ||
snprintf (datestr, sizeof (datestr), "N/A"); | ||
flux_log (h, LOG_INFO, | ||
"restored kvs-primary from checkpoint on %s", datestr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be in its own function, if not folded into checkpoint_get()
.
d0ae5b2
to
db3f6fa
Compare
re-pushed addressing all of @garlick's comments + fixing the I kept everything as fixups, lemme know if it would be preferred to just squash all the fixups. |
db3f6fa
to
3175ec6
Compare
Thanks! I think you can squash the fixups, then I'll give it another pass. |
3175ec6
to
3bee205
Compare
squashed all the fixups and re-pushed as an aside, I tried to generate the |
Sometimes it is easier to use a here-doc outside of any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Just noticed one minor errno clobber looks like it's still there.
Couple questions:
- where does the python sqlite3 module come from? Is there a dependency that configure.ac should be checking for? (I realize it wasn't introduced in this PR, so just wondering)
- I have another PR going right now that calls
flux_kvs_getroot()
then writes a checkpoint. I wonder if it would be a good idea to add some non-exported functions tolibkvs
for reuse within flux-core?
I could actually take your functions and turn them into shared ones in my PR if that helps keep things moving along.
json_decref (o); | ||
free (value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
json_decref()
and free()
potentially clobber errno.
3bee205
to
48d89cd
Compare
just did a minor re-push, i realized |
Good question, I initially assumed we didn't have it installed, then to my surprise its actually packaged with the base
I had pondered about this too, b/c we checkpoint the guest namespaces in
If you're thinking they'd be useful, perhaps some higher level functions would be worthwhile. Are you mostly thinking about functions that would create the checkpoint object for users? and presumably parse said checkpoint object? |
Problem: It'd be convenient if we knew the date when the primary namespace was checkpointed. Solution: When checkpointing the primary namespace, store a json object with version, rootref, and timestamp, instead of just the rootref string. On retrieval, parse appropriately and retrieve timestamp for output in logs. Support the original checkpointing format by checking if the checkpoint object is a raw blobref string first. Fixes flux-framework#3580
48d89cd
to
e1b0450
Compare
re-pushed, fixing up the potential |
What I have at the moment is a
Calling this from cron, for example, would let us recover more data in a crash. Also, if we append to an eventlog on startup and on exit, we can detect when an instance starts up that was not properly shut down. But we need to update the checkpoint after the startup append, otherwise the instance always reverts to the last valid shutdown. |
Codecov Report
@@ Coverage Diff @@
## master #4136 +/- ##
==========================================
- Coverage 83.33% 80.08% -3.26%
==========================================
Files 376 376
Lines 63037 62636 -401
==========================================
- Hits 52533 50162 -2371
- Misses 10504 12474 +1970
|
would it worthwhile for |
Well, I do think we want to keep the backends fairly "dumb" so any future changes happen in one place instead of three (or more). However, one nice change we could make would be to use a json "o" as the value, and allow it to be either a json string or a json object, so we don't have to encode the json object as a json string that we put inside a json object :-) |
Ahh that's a good idea. I was going to do a |
Maybe the next PR could introduce some reusable (but not public) functions in libkvs and make that protocol change? |
sounds good!, will make an issue after this PR goes in (not sure why the flux-sched build is taking forever) |
re-kicking the tests, the sched builder got stuck in git clone for some reason |
Thought I'd tackle this old issue. I did a relatively simple thing for the check pointing, writing out a serialized json object storing the rootref + timestamp instead of just the rootref.
Multiple other possibilities, but thought this was the simplest / reasonable (i.e. no need to change content store implementation). Other options include writing two keys out instead of one.
Note that I intentionally did not support backwards compatibility. Figured use of this was rare at the moment.