Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve KVS checkpoint protocol to allow for future changes #4149

Merged
merged 11 commits into from
Mar 2, 2022

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Feb 19, 2022

Problem: We would like to checkpoint more complex json data structures to the content backing service, but since the content checkpoint services only take string values, it is inconvenient to have to encode/decode all json objects into/from strings.

Solution: Update the checkpoint protocol to instead take / send a json object. Support backwards compatibility by sending a json string when the data stored is not valid json. Update all callers accordingly and add additional tests.

Fixes #4144

Couple of side notes:

  • no new tests via s3 content backing, as I'm not entirely sure how to test at the moment. Perhaps there's some notes somewhere I'm not looking?

  • i pondered updating value to data everywhere, b/c value suggests just a string. But the sqlite database uses value as the column header, so I kept it as is.

@garlick
Copy link
Member

garlick commented Feb 19, 2022

no new tests via s3 content backing, as I'm not entirely sure how to test at the moment. Perhaps there's some notes somewhere I'm not looking?

ci testing for it was added with #3025, although I'm having a bit of trouble unwinding how to run it manually. At least if you add tests to t0024-content-s3.t and push here, it will get run.

From raw logs of ci / bionic - gcc-8,content-s3,distcheck

2022-02-19T01:44:26.0389891Z         t0024-content-s3.t:  PASS: N=28  PASS=25  FAIL=0 SKIP=3 XPASS=0 XFAIL=0

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of quick comments inline.

I think we probably only need to support "version 0" in content-sqlite, since I seriously doubt anybody has a running system instance using the other backing stores at this point.

Edit: in the other backing stores, you could leave a comment in the checkpoint_get() functions that recovery from a version 0 checkpoint is not supported.

Comment on lines 401 to 417
if (strlen (key) == 0) {
errno = EINVAL;
goto error;
}
if (!(value = json_dumps (o, JSON_ENCODE_ANY))) {
errno = EINVAL;
goto error;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really only need to support reading the version 0 checkpoint, not writing it, so JSON_ENCODE_ANY is not needed in the json_dumps() . While you're there, may as well add JSON_COMPACT. The strlen check can be dropped too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhh i remember why i did JSON_ENCODE_ANY before, it's b/c tons of tests like in t0012-content-sqlite still write general strings instead of objects. Do we still want to support generically writing strings to checkpoint? I assume no b/c this is specifically the kvs checkpoint put/get rpc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I had forgotten about those. No, we should probably fix the tests to write valid objects.

@@ -353,18 +356,28 @@ void checkpoint_get_cb (flux_t *h,
errno = ENOENT;
goto error;
}
s = (char *)sqlite3_column_text (ctx->checkpt_get_stmt, 0);
if (!(o = json_loads (s, JSON_DECODE_ANY, NULL))) {
/* version 0 checkpoint may have been just a blobref string */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/may have been just/is/

Could also add a note that version 0 was used prior to flux-core 0.36?

Didn't think of this before, but instead of json_string(s) maybe we should just do

o = json_pack ("{s:s s:s s:f}", "version", 0, "value", "rootref", s, "timestamp", 0.);

That would simplify the code on the other end.

Copy link
Member Author

@chu11 chu11 Feb 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

o = json_pack ("{s:s s:s s:f}", "version", 0, "value", "rootref", s, "timestamp", 0.);

I'm not following. Perhaps you messed up the json_pack? I think there's more args than the format can take (edit: and "version" is taking a string instead of int?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops, right - delete "value". I mean why not construct a consistent "checkpoint object" for version 0 for outside users since it doesn't add any more complexity really.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait, i think i know what you meant ...

if (!(o = json_loads (...))) {
      /* if data is a version 0 blobref, respond with version 1 format */
      if (blobref_validate (str))
         o = json_pack ("{s:i s:s s:f}", "version", 0, "rootref", s, "timestamp", 0.);
}

Copy link
Member

@garlick garlick Feb 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The title of this PR might be better for the release notes if it read "enhance KVS checkpoint protocol."

@chu11 chu11 changed the title content-{sqlite,files,s3}: refactor checkpoint content-{sqlite,files,s3}: enhance KVS checkpoint protocol Feb 22, 2022
@chu11 chu11 force-pushed the issue4144_checkpoint_obj branch from 10497b9 to a0b5c86 Compare February 22, 2022 23:17
@chu11
Copy link
Member Author

chu11 commented Feb 22, 2022

re-pushed, correctly things from above + more

  • fixed a memleak / possible corruption found by valgrind (should do s:O instead of s:o)
  • use JSON_COMPACT instead of JSON_ENCODE_ANY
    • and similarly no need to use JSON_DECODE_ANY
  • re-work checkpoint protocol, use o = json_pack ("{s:s ...}", "version", 0, ...) instead of sending raw string when supporting backwards compatibility.
  • update tests for changes
  • add some s3 tests (i hope they work in the CI, presently untested) (Edit: looks like it worked!)
  • remove version 0 backwards support in content-{files,s3}
  • remove some unnecessary tests

Edit: apologies if anyone started reviewing, i think i pushed an old branch to github at one point. Updated ~3:29pm

@chu11 chu11 force-pushed the issue4144_checkpoint_obj branch from a0b5c86 to 0659a7b Compare February 22, 2022 23:28
@garlick
Copy link
Member

garlick commented Feb 24, 2022

Ooops I missed that this was ready for another review. Do you want to squash the fixups and I'll make another pass?

@chu11
Copy link
Member Author

chu11 commented Feb 24, 2022

@garlick lemme squash and also update given #4153

@chu11 chu11 force-pushed the issue4144_checkpoint_obj branch from 99a334d to 0a96fd4 Compare February 24, 2022 21:49
@chu11
Copy link
Member Author

chu11 commented Feb 24, 2022

re-pushed, squashed fixes and fixed up the new startlog code from #4153. Code in startlog.c is not the prettiest, figure will be fixed up when #4145 is done (which I'm currently working on)

@chu11 chu11 force-pushed the issue4144_checkpoint_obj branch 2 times, most recently from 206323b to 93503b3 Compare February 24, 2022 21:52
@garlick
Copy link
Member

garlick commented Feb 24, 2022

Code in startlog.c is not the prettiest, figure will be fixed up when #4145 is done (which I'm currently working on)

Maybe it would be a good idea to combine this and that? It's an internal function so probably not vital that it appear standalone in release notes.

@chu11
Copy link
Member Author

chu11 commented Feb 24, 2022

Maybe it would be a good idea to combine this and that? It's an internal function so probably not vital that it appear standalone in release notes.

Sure, we can do that!

@chu11 chu11 force-pushed the issue4144_checkpoint_obj branch from 93503b3 to 53e63b9 Compare February 25, 2022 21:46
@chu11
Copy link
Member Author

chu11 commented Feb 25, 2022

re-pushed, adding the new kvs checkpoint helper functions in libkvs, and then updating usage in startlog and modules/kvs.

@chu11
Copy link
Member Author

chu11 commented Feb 25, 2022

hmmm, hit a concerning build error on bionic

   expecting success: 
  	today=`date --iso-8601` &&
  	grep checkpoint dmesgfiles.out | grep ${today}
  
  not ok 17 - verify date in flux logs (files)

haven't seen this locally yet. will be adding some debug commits, perhaps date outputs something unexpected on this builder.

@chu11
Copy link
Member Author

chu11 commented Feb 26, 2022

one builder also failed with this in t3200-instance-restart. I actually saw a few times under a system with heavy load but couldn't reproduce after awhile, thought maybe system was crazy busy for a short burst. But hmmm.

  expecting success: 
  	grep $(cat files_id1.out) files_list.out
  
  not ok 15 - inactive job list contains job from before restart

@chu11
Copy link
Member Author

chu11 commented Feb 26, 2022

Finally got a good piece of debug info from a local run on fluke.

2022-02-26T01:39:38.377231Z kvs.err[0]: kvs_checkpoint_lookup_get_rootref: Invalid argument

then looked at content-files/kvs-primary

{"version":1,"rootref":"sha1-2b30390bf27cf7bba1222dc032797d8345b2669a","timestamp":1645839598.682987}}

huh? why is there an extra curly brace in there? That would explain the EINVAL. Some buffer overflow or non-NULL termination?

thought about it and came to following theory, content-files originally created files based on the sha1-hash, so effectively all filenames are unique. So

    if ((fd = open (path, O_WRONLY | O_CREAT, 0666)) < 0)
        return -1;
    if (write_all (fd, data, size) < 0) {
        ERRNO_SAFE_WRAP (close, fd);
        return -1;
    }
    if (close (fd) < 0)
        return -1;

is ok. But the kvs-primary key, the filename is always the same and the checkpoint can be re-written over and over again into the same file. This wasn't a problem when we wrote the sha1-hash rootref over and over again, since its the same length everytime.

With the new json object format, the object can be of variable length (b/c of the timestamp field), thus we're overwriting a prior entry, and if they are of different lengths ... badness. Thus the possibility of an invalid json object eventually being stored in the kvs-primary file.

So I think we need to add a O_TRUNC above. Will try later this weekend.

Whew, this was a tough one. Hopefully this is correct :P

Edit: and this also explains the seeming randomness of the errors. It seemed "racy", but it's more about dumb luck about how the timestamp is formatted in the stored kvs-primary file. If the second json object was equal or longer, test worked fine. Only in the case where the object was shorter would there be issues.

@garlick
Copy link
Member

garlick commented Feb 26, 2022

Nice bit of sleuthing there!

@garlick garlick added this to the flux-core v0.36.0 milestone Feb 28, 2022
@chu11 chu11 force-pushed the issue4144_checkpoint_obj branch 2 times, most recently from f700c0b to 267dcbc Compare February 28, 2022 18:40
@codecov
Copy link

codecov bot commented Feb 28, 2022

Codecov Report

Merging #4149 (c686cd4) into master (b857803) will decrease coverage by 0.00%.
The diff coverage is 72.81%.

❗ Current head c686cd4 differs from pull request most recent head 267dcbc. Consider uploading reports for the commit 267dcbc to get more accurate results

@@            Coverage Diff             @@
##           master    #4149      +/-   ##
==========================================
- Coverage   83.42%   83.41%   -0.01%     
==========================================
  Files         379      380       +1     
  Lines       63426    63471      +45     
==========================================
+ Hits        52910    52947      +37     
- Misses      10516    10524       +8     
Impacted Files Coverage Δ
src/modules/kvs/kvs.c 69.63% <66.66%> (-1.07%) ⬇️
src/modules/content-files/content-files.c 78.57% <71.42%> (-1.75%) ⬇️
src/common/libkvs/kvs_checkpoint.c 72.88% <72.88%> (ø)
src/modules/content-sqlite/content-sqlite.c 56.15% <78.57%> (-0.86%) ⬇️
src/cmd/builtin/startlog.c 89.47% <100.00%> (ø)
src/broker/state_machine.c 81.75% <0.00%> (-0.66%) ⬇️
src/shell/output.c 76.81% <0.00%> (-0.23%) ⬇️
... and 10 more

@chu11
Copy link
Member Author

chu11 commented Feb 28, 2022

re-pushed, added a commit for adding O_TRUNC to the open() call and that seemed to fix things.

Before adding O_TRUNC, running make -j16 check 10 times on catalyst, I got two failures of not ok 15 - inactive job list contains job from before restart.

After adding O_TRUNC, running make -j16 check 25 times on catalyst, I saw no kvs checkpoint related errors (although I did see #4169 a few times).

@grondo
Copy link
Contributor

grondo commented Feb 28, 2022

If this is close, don't forget to update the PR title for release notes (and I'll update release notes).

@garlick
Copy link
Member

garlick commented Feb 28, 2022

Ah O_TRUNC looks like it was the right thing. Good job. OK, starting a review pass.

@chu11 chu11 changed the title content-{sqlite,files,s3}: enhance KVS checkpoint protocol content-{sqlite,files,s3}: enhance KVS checkpoint protocol, add KVS checkpoint utility library Feb 28, 2022
Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few comments - feel free to push back if you feel I'm off base.

Also: is there a way we could restructure the commits to avoid the history churn within this PR at the call points of the kvs_checkpoint*() helpers? For example, introduce the helpers first but implement the old protocol, then switch users to the helpers, then change protocol?

@@ -46,26 +46,45 @@ static void content_flush (flux_t *h)
static void kvs_checkpoint_put (flux_t *h, const char *treeobj)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the commit message:

content-{sqlite,files,s3}: refactor checkpoint

This goes beyond those three modules, so the prefix should probably be dropped. Also "refactor checkpoint" does not communicate much about what is changing. Maybe something like

improve KVS checkpoint protocol to allow for future changes

Problem: the kvs-checkpoint.get and kvs-checkpoint.put methods provided
by content back end modules operate on (key, blobref) tuples, but we would like
to store other metadata with checkpoints, such as a date, and not have to
change the database schema in every back end whenever the value format changes.

Solution: change the stored value to a json object that contains a blobref, a date,
and a version=1. The back ends just passively shuttle the json object without
interpretation, so they should not need to be changed for updates going forward.
There is one exception this time: to avoid losing data on system instance installations
when rolling out this change, content-sqlite translates the old format to a json object
with version=0.

Update users of the kvs-checkpoint methods and add tests.

Comment on lines 60 to 64
/* version 0 checkpoint
* - blobref string only
* version 1 checkpoint object
* - {"version":1 "rootref":s "timestamp":f}
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop comment that's removed by a later commit.

Comment on lines 69 to 70
errno = ENOMEM;
log_err_exit ("Error encoding checkpoint object");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the ENOMEM is just a guess anyway, suggest skipping it and just calling log_msg_exit().

Comment on lines 168 to 176
if (size > 0) {
/* recovery from version 0 checkpoint blobref not supported */
if (!(o = json_loads (data, 0, NULL))) {
errno = EINVAL;
goto error;
}
}
else
o = json_null ();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not thinking of a use case for translating empty value to json_null() so maybe that check is not needed and the empty string should just be passed to json_loads(), which will then fail.

Speaking of, we have a size for the data, so it might be good to use json_loadb() instead.

Also, might as well pass it a json_error_t and then set errstr to error.text before the goto error.

Comment on lines 215 to 217
if (!(value = json_dumps (o, JSON_COMPACT))) {
errno = EINVAL;
goto error;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to set errstr = "failed to encode checkpoint value"; here.

Comment on lines 406 to 418
if (!(value = json_dumps (o, JSON_COMPACT))) {
errno = EINVAL;
goto error;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to set errstr = "failed to encode checkpoint value"; here.

Comment on lines 2710 to 2712
/* Synchronously get checkpoint databy key from checkpoint service.
* Copy rootref buf with '\0' termination.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/databy/data by/

@@ -0,0 +1,33 @@
/************************************************************\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there isn't a conflicting public version of this header like say message.h in libflux, maybe this can just be called kvs_checkpoint.h?

Comment on lines 23 to 25
flux_future_t *kvs_checkpoint_update (flux_t *h,
const char *key,
const char *rootref)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would kvs_checkpoint_commit() sound slightly better? I dunno, just a thought.

@@ -87,7 +87,7 @@ int filedb_put (const char *dbpath,
*errstr = "key name too long for internal buffer";
return -1;
}
if ((fd = open (path, O_WRONLY | O_CREAT, 0666)) < 0)
if ((fd = open (path, O_WRONLY | O_CREAT | O_TRUNC, 0666)) < 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find, this is just a straight up bug (and my bad).

@chu11
Copy link
Member Author

chu11 commented Mar 1, 2022

Also: is there a way we could restructure the commits to avoid the
history churn within this PR at the call points of the
kvs_checkpoint*() helpers? For example, introduce the helpers first
but implement the old protocol, then switch users to the helpers, then
change protocol?

Hmmm, not too easily, as I'd be changing a lot of code, sort of starting from scratch. Unfortunately started this before your
startlog PR, so doing it the other way around wasn't as obvious at the time.

@chu11
Copy link
Member Author

chu11 commented Mar 1, 2022

Re-pushed addressing all of the comments above. Went ahead and squashed since everything is tiny changes.

@chu11 chu11 changed the title content-{sqlite,files,s3}: enhance KVS checkpoint protocol, add KVS checkpoint utility library improve KVS checkpoint protocol to allow for future changes Mar 1, 2022
Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the changes, looking good!
I just had a couple more suggestions for cleanup, but they're fairly trivial so I'll tentatively mark approved.

Comment on lines 39 to 43
/* version 0 checkpoint
* - rootref string only
* version 1 checkpoint object
* - {"version":1 "rootref":s "timestamp":f}
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the commit function only ever writes version 1, comment can be deleted.
(Anyway, from here, the version 0 checkpoint is also an object, just missing timestamp)

Comment on lines 44 to 61
if (!(o = json_pack ("{s:i s:s s:f}",
"version", 1,
"rootref", rootref,
"timestamp", timestamp))) {
errno = ENOMEM;
goto error;
}

if (!(f = flux_rpc_pack (h,
"kvs-checkpoint.put",
0,
0,
"{s:s s:O}",
"key",
key,
"value",
o)))
goto error;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for the separate json_pack(). Just build the whole payload in flux_rpc_pack().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh duh, that's obvious now. I think I could do in a few other places too.

if (flux_rpc_get_unpack (f, "{s:o}", "value", &o) < 0)
return -1;

/* N.B. no need to check version, all versions support rootref */
Copy link
Member

@garlick garlick Mar 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should actually check the version here (and generally, that the checkpoint object is well formed), since the back ends are only checking that there is a valid json object?

Also: combine unpacks (see comment in get_formatted_timestamp())

Comment on lines 124 to 132
if (flux_rpc_get_unpack (f, "{s:o}", "value", &o) < 0)
return -1;

if (json_unpack (o, "{s:i}", "version", &version) < 0) {
errno = EINVAL;
return -1;
}

if (version == 0) {
snprintf (buf, len, "N/A");
}
else if (version == 1) {
double timestamp;
time_t sec;
struct tm tm;

if (json_unpack (o, "{s:f}", "timestamp", &timestamp) < 0) {
errno = EINVAL;
return -1;
}
sec = timestamp;
gmtime_r (&sec, &tm);
strftime (buf, len, "%FT%T", &tm);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: combine the three unpacks into one, e.g.

if (flux_rpc_get_unpack (o, "{s:{s:i s:s s?f}}",  ...) < 0)
    return -1;
if (version != 0 && version != 1) {
    errno = EINVAL;
    return -1;
}

@chu11 chu11 force-pushed the issue4144_checkpoint_obj branch from 400d7da to 34eaf92 Compare March 1, 2022 18:41
@chu11
Copy link
Member Author

chu11 commented Mar 1, 2022

re-pushed addressing the comments above. @garlick wanna do one more quick skim before I set MWP?

@chu11
Copy link
Member Author

chu11 commented Mar 1, 2022

hmmm, hit one build error that i'm 99% sure is not releated to this PR. restarted builders.

  expecting success: 
          ${FLUX_BUILD_DIR}/t/rexec/rexec -r 1 sleep 100 &
          pid1=$!
          ${FLUX_BUILD_DIR}/t/rexec/rexec -r 1 sleep 100 &
          pid2=$!
  	sleep 1 &&
          ${FLUX_BUILD_DIR}/t/rexec/rexec_ps -r 1 > output &&
          count=`cat output | wc -l` &&
          test "$count" = "2" &&
  	sleep 1 &&
  	kill -TERM $pid1 &&
  	kill -TERM $pid2 &&
          ${FLUX_BUILD_DIR}/t/rexec/rexec_ps -r 1 > output &&
          count=`cat output | wc -l` &&
          test "$count" = "0"
  
  not ok 27 - disconnect terminates all running processes

@grondo
Copy link
Contributor

grondo commented Mar 1, 2022

We've been seeing that one lately: #4097

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! This time just a couple of minor things. If ci isn't complaining, no biggie I guess.

int version;
double timestamp;

if (!f || !buf || len <= 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think len can be less than zero as size_t is unsigned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, actually, probably the right thing is to ignore len since zero isn't the only value that's too short. Instead, check the return value of snprintf() and strftime() to catch any overflow.

size_t len)
{
int version;
double timestamp;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would initialize timestamp to 0. for good measure since it's conditionally assigned. It's also conditionally used, but matching those two conditions up might be too much to ask of some compilers.

@chu11 chu11 force-pushed the issue4144_checkpoint_obj branch from 34eaf92 to 9ac82ef Compare March 1, 2022 21:22
@chu11
Copy link
Member Author

chu11 commented Mar 1, 2022

just re-pushed, just cleaned up the two things @garlick noted above

diff --git a/src/common/libkvs/kvs_checkpoint.c b/src/common/libkvs/kvs_checkpoint.c
index abc7567..78e640d 100644
--- a/src/common/libkvs/kvs_checkpoint.c
+++ b/src/common/libkvs/kvs_checkpoint.c
@@ -97,9 +97,9 @@ int kvs_checkpoint_lookup_get_formatted_timestamp (flux_future_t *f,
                                                    size_t len)
 {
     int version;
-    double timestamp;
+    double timestamp = 0.;
 
-    if (!f || !buf || len <= 0) {
+    if (!f || !buf) {
         errno = EINVAL;
         return -1;
     }
@@ -119,11 +119,17 @@ int kvs_checkpoint_lookup_get_formatted_timestamp (flux_future_t *f,
         time_t sec = timestamp;
         struct tm tm;
         gmtime_r (&sec, &tm);
-        strftime (buf, len, "%FT%T", &tm);
+        if (strftime (buf, len, "%FT%T", &tm) == 0) {
+            errno = EINVAL;
+            return -1;
+        }
+    }
+    else { /* version == 0 */
+        if (snprintf (buf, len, "N/A") >= len) {
+            errno = EINVAL;
+            return -1;
+        }
     }
-    else /* version == 0 */
-        snprintf (buf, len, "N/A");
-
     return 0;
 }

@garlick
Copy link
Member

garlick commented Mar 1, 2022

Ready for MWP on this one?

@chu11
Copy link
Member Author

chu11 commented Mar 1, 2022

yup! thanks for review.

chu11 added 11 commits March 1, 2022 22:42
Problem: When calling flux_respond_pack(), an unnecessary parameter
is passed to the variable argument function.

Solution: Remove the unnecessary argument.
Problem: In the t0018-content-files.t and t0024-content-s3.t tests,
a test output file was used twice, thus overwriting the output from
a prior test.

Solution: Rename the filenames to be unique.
Problem: Tests in t2010-kvs-snapshot-restore.t largely duplicated
tests in t0012-content-sqlite.t.

Remove the duplicate tests and test files.
Problem: The kvs-checkpoint.put callback checks if the key input by the
user is non-empty.  This check is unnecessary.

Remove the check.
Problem: the kvs-checkpoint.get and kvs-checkpoint.put methods provided
by content back end modules operate on (key, blobref) tuples, but we would like
to store other metadata with checkpoints, such as a date, and not have to
change the database schema in every back end whenever the value format changes.

Solution: change the stored value to a json object that contains a blobref, a
date, and a version=1. The back ends just passively shuttle the json object
without interpretation, so they should not need to be changed for updates going
forward.  There is one exception this time: to avoid losing data on system
instance installations when rolling out this change, content-sqlite translates
the old format to a json object with version=0.

Update users of the kvs-checkpoint methods and add tests.

Fixes flux-framework#4144
Problem: Some kvs checkpointing operations are duplicated.  It
would be convenient if there were common functions for it.

Solution: Add common checkpointing operations into a new
kvs_checkpoint api in libkvs.  Keep the API private for now.
Add unit tests.

Fixes flux-framework#4145
Problem: startlog code can be simplified by using kvs checkpointing
helper functions.

Solution: Use kvs_checkpoint_update() to simplify kvs checkpoint code.
Problem: kvs code can be simplified by using kvs checkpointing
helper functions.

Solution: Use kvs checkpointing helper functions when handling
kvs primary checkpoint.
Problem: In content-files, most files are created using a hash
of the data.  So in most cases the filenames are unique.  However,
the kvs-checkpoint.put rpc will typically write to the same file
over and over again.  In the event the file already exists, content-files
did not clear the data in a file before writing.  This can lead to
corrupted data if the user wrote fewer bytes to the file than before.

Solution: Call open() with O_TRUNC to truncate the file before writing
new data.  Add tests increasing / decreasing checkpoint data size.
@jameshcorbett jameshcorbett force-pushed the issue4144_checkpoint_obj branch from 9ac82ef to 43bc1db Compare March 1, 2022 22:42
@chu11
Copy link
Member Author

chu11 commented Mar 1, 2022

hmm, builder failed with

ERROR: t2402-job-exec-dummy.t - exited with status 141 (terminated by signal 13?)

All tests passed in t2402-job-exec-dummy, so this is on shutdown. 13 = SIGPIPE, which I'm not entirely sure how that could happen. Perhaps something slow going on during shutdown? Restart builder.

@chu11
Copy link
Member Author

chu11 commented Mar 2, 2022

@Mergifyio refresh

@mergify
Copy link
Contributor

mergify bot commented Mar 2, 2022

refresh

✅ Pull request refreshed

@mergify mergify bot merged commit 0bab06e into flux-framework:master Mar 2, 2022
@chu11 chu11 deleted the issue4144_checkpoint_obj branch March 2, 2022 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

content-sqlite / content-files : checkpoint functions should put/get object instead of string
3 participants