Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvs commit handling should propagate content cache errors back to user #792

Closed
garlick opened this issue Sep 2, 2016 · 8 comments
Closed
Assignees

Comments

@garlick
Copy link
Member

garlick commented Sep 2, 2016

If an error occurs while making an RPC to the content-cache service, it is not handled properly in the following cases:

  • store() handles synchronous errors, but async errors are only logged, not propagated to the caller
  • load() errors (sync and async) are only logged, not propagated to the caller

Further, after an async cache fill or cache flush error occur, the cache entry can be left dirty or invalid forever.

@garlick
Copy link
Member Author

garlick commented Sep 2, 2016

The description of store() above assumes #788 gets merged.

@garlick
Copy link
Member Author

garlick commented Sep 12, 2016

With PR #801 it becomes possible to put a dangling blobref to the KVS. In addition to managing the errors above, the commit logic should ensure that any new references are loaded into the rank 0 cache to verify that they work. Alternatively we might want to add a content store operation that simply checks if a blobref has been stored without returning it, since one use case for creating such a reference outside the KVS would be to avoid running a large value through the KVS commit logic.

@chu11
Copy link
Member

chu11 commented Aug 29, 2017

Update here given all the recent changes to the KVS:

store() handles synchronous errors, but async errors are only logged, not propagated to the caller

Still true, but now is in the area of content_store_get().

load() errors (sync and async) are only logged, not propagated to the caller

Synchronous errors are handled now.

Further, after an async cache fill or cache flush error occur, the cache entry can be left dirty or invalid forever.

I believe these error paths have now been handled properly.

@chu11
Copy link
Member

chu11 commented Oct 17, 2017

Per discussion in #1227 - a good test to add once this is complete is to have a valref with multiple blobs in it contain an illegal blobref.

@garlick
Copy link
Member Author

garlick commented Apr 24, 2018

Do we still have any content errors that are not handled properly? I'm going to tentatively close this as it seems like we've handled them all. @chu11 please reopen if I'm wrong.

@garlick garlick closed this as completed Apr 24, 2018
@chu11
Copy link
Member

chu11 commented Apr 24, 2018

Yeah, the asynchronous cases still exist. Basically the replay of a stalled request is still triggered when a value is updated in the local kvs cache. If an error occurs such that the cache is never updated, then the replay of the original request won't ever be triggered to complete.

@chu11
Copy link
Member

chu11 commented Nov 9, 2018

As noted in #1799, it's not isolated to "content cache" errors that are part of this issue. If a blobref happens to be missing, this is equivalent to a "content cache" error.

With the completion of #1696 / #1318, I think an easy way to fix this is via msg aux_set/aux_get. If an error occurs in asynchronous communications (such as in content_store_get or content_load_completion), use msg aux_set to set some error field of some sort, then on replay check if that error field has been set.

I'll need some new function in the cache API to iterate through waitqueue entries and set the msg aux field, but this seems like a not so horrible path to finally fix this.

@chu11
Copy link
Member

chu11 commented Nov 9, 2018

doh, the above assumed that all waiters had messages, but that's not always the case. What I really need is to set some type of "error callback" on a wait_t so that when the waiter is replayed it knows to call that callback and do something (call lookup_set_aux_errnum()) before a replay occurs.

chu11 added a commit to chu11/flux-core that referenced this issue Nov 13, 2018
Allow many errors during an asynchronous store to be returned
to the original caller, instead of just hanging.  A few catastrophic
errors cannot be recovered from.

Fixes flux-framework#792
chu11 added a commit to chu11/flux-core that referenced this issue Nov 13, 2018
Allow many errors during an asynchronous store to be returned
to the original caller, instead of just hanging.  A few catastrophic
errors cannot be recovered from.

Fixes flux-framework#792
chu11 added a commit to chu11/flux-core that referenced this issue Nov 14, 2018
Allow many errors during an asynchronous store to be returned
to the original caller, instead of just hanging.  A few catastrophic
errors cannot be recovered from.

Fixes flux-framework#792
chu11 added a commit to chu11/flux-core that referenced this issue Nov 14, 2018
Allow many errors during an asynchronous store to be returned
to the original caller, instead of just hanging.  A few catastrophic
errors cannot be recovered from.

Fixes flux-framework#792
chu11 added a commit to chu11/flux-core that referenced this issue Nov 14, 2018
Allow many errors during an asynchronous store to be returned
to the original caller, instead of just hanging.  A few catastrophic
errors cannot be recovered from.

Fixes flux-framework#792
chu11 added a commit to chu11/flux-core that referenced this issue Nov 15, 2018
Allow many errors during an asynchronous store to be returned
to the original caller, instead of just hanging.  A few catastrophic
errors cannot be recovered from.

Fixes flux-framework#792
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants