kvs commit handling should propagate content cache errors back to user #792

garlick · 2016-09-02T22:06:40Z

If an error occurs while making an RPC to the content-cache service, it is not handled properly in the following cases:

store() handles synchronous errors, but async errors are only logged, not propagated to the caller
load() errors (sync and async) are only logged, not propagated to the caller

Further, after an async cache fill or cache flush error occur, the cache entry can be left dirty or invalid forever.

garlick · 2016-09-02T22:07:18Z

The description of store() above assumes #788 gets merged.

garlick · 2016-09-12T14:01:23Z

With PR #801 it becomes possible to put a dangling blobref to the KVS. In addition to managing the errors above, the commit logic should ensure that any new references are loaded into the rank 0 cache to verify that they work. Alternatively we might want to add a content store operation that simply checks if a blobref has been stored without returning it, since one use case for creating such a reference outside the KVS would be to avoid running a large value through the KVS commit logic.

chu11 · 2017-08-29T20:26:34Z

Update here given all the recent changes to the KVS:

store() handles synchronous errors, but async errors are only logged, not propagated to the caller

Still true, but now is in the area of content_store_get().

load() errors (sync and async) are only logged, not propagated to the caller

Synchronous errors are handled now.

Further, after an async cache fill or cache flush error occur, the cache entry can be left dirty or invalid forever.

I believe these error paths have now been handled properly.

chu11 · 2017-10-17T00:51:07Z

Per discussion in #1227 - a good test to add once this is complete is to have a valref with multiple blobs in it contain an illegal blobref.

garlick · 2018-04-24T02:09:09Z

Do we still have any content errors that are not handled properly? I'm going to tentatively close this as it seems like we've handled them all. @chu11 please reopen if I'm wrong.

chu11 · 2018-04-24T17:08:51Z

Yeah, the asynchronous cases still exist. Basically the replay of a stalled request is still triggered when a value is updated in the local kvs cache. If an error occurs such that the cache is never updated, then the replay of the original request won't ever be triggered to complete.

chu11 · 2018-11-09T00:05:29Z

As noted in #1799, it's not isolated to "content cache" errors that are part of this issue. If a blobref happens to be missing, this is equivalent to a "content cache" error.

With the completion of #1696 / #1318, I think an easy way to fix this is via msg aux_set/aux_get. If an error occurs in asynchronous communications (such as in content_store_get or content_load_completion), use msg aux_set to set some error field of some sort, then on replay check if that error field has been set.

I'll need some new function in the cache API to iterate through waitqueue entries and set the msg aux field, but this seems like a not so horrible path to finally fix this.

chu11 · 2018-11-09T01:04:44Z

doh, the above assumed that all waiters had messages, but that's not always the case. What I really need is to set some type of "error callback" on a wait_t so that when the waiter is replayed it knows to call that callback and do something (call lookup_set_aux_errnum()) before a replay occurs.

Allow many errors during an asynchronous store to be returned to the original caller, instead of just hanging. A few catastrophic errors cannot be recovered from. Fixes flux-framework#792

garlick mentioned this issue Sep 12, 2016

add KVS blobref access functions #801

Merged

SteVwonder mentioned this issue Sep 14, 2016

EAGAIN experienced when using Python KVS bindings #793

Closed

chu11 added this to the release 0.8.0 milestone Apr 4, 2017

chu11 modified the milestones: release 0.9.0, release 0.8.0 Aug 23, 2017

chu11 mentioned this issue Oct 17, 2017

KVS: Support reading valref objects with multiple blobrefs #1227

Merged

garlick closed this as completed Apr 24, 2018

chu11 reopened this Apr 24, 2018

garlick modified the milestones: release 0.9.0, release 0.10.0 May 4, 2018

grondo modified the milestones: release 0.10.0, release 0.11.0 Jul 10, 2018

chu11 mentioned this issue Nov 8, 2018

kvs: lookup left hanging if content is missing #1799

Closed

chu11 self-assigned this Nov 9, 2018

garlick closed this as completed in ba1b0ac Nov 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvs commit handling should propagate content cache errors back to user #792

kvs commit handling should propagate content cache errors back to user #792

garlick commented Sep 2, 2016

garlick commented Sep 2, 2016

garlick commented Sep 12, 2016

chu11 commented Aug 29, 2017

chu11 commented Oct 17, 2017

garlick commented Apr 24, 2018

chu11 commented Apr 24, 2018

chu11 commented Nov 9, 2018

chu11 commented Nov 9, 2018 •

edited

Loading

kvs commit handling should propagate content cache errors back to user #792

kvs commit handling should propagate content cache errors back to user #792

Comments

garlick commented Sep 2, 2016

garlick commented Sep 2, 2016

garlick commented Sep 12, 2016

chu11 commented Aug 29, 2017

chu11 commented Oct 17, 2017

garlick commented Apr 24, 2018

chu11 commented Apr 24, 2018

chu11 commented Nov 9, 2018

chu11 commented Nov 9, 2018 • edited Loading

chu11 commented Nov 9, 2018 •

edited

Loading