-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloud NDB - suppressed failures to set and delete in memcache cause cache inconsistency #656
Comments
@justinkwaugh The reasoning here makes total sense, but I haven't been able to reproduce the issue IRL. Maybe your test case puts a much larger strain in memcache that what I'm doing to test with a live application. If you wanted to deploy the branch in #665 and test with your application, that would be superb. Thanks! |
Thanks for the update @chrisrossi. I will try to find time to do so, but I can't guarantee reproduceability. Like most transient issues, it's hard to reproduce, though it can obviously be simulated in a test case. Basically I have a set of regression tests (~500) that run in parallel using task queues in app engine and they update a single record in the datastore on completion which stores the number of pass/fail for that run, and very conveniently it would end with cache inconsistency for that record every so often (maybe 1 in 5 times). I've monkey patched in my own fix, but regardless I will definitely take a close look at your pull request as well. |
@chrisrossi I should also add that it's a little hard to test an individual change like this because there are so many issues which combine to cause inconsistency. #651 / #652 also contribute strongly to the problem, as they allow a reader thread to just do whatever it likes when combined with a failure to delete by the writer thread (which happens due to connection reset, or possibly a silent failure to delete, which I don't think can be detected even with noreply=False) |
Cloud NDB v 1.8.0, app engine standard python 3
Ok, after much wailing and gnashing of teeth, I finally got to the root of a cache inconsistency issue I was experiencing. It's transient, but happens regularly, and I finally pinned it down.
At a high level, the problem is that both the setting and the deleting of the lock value can fail silently (and will do so for the same key in the same put call more than you would think!). This completely invalidates any attempts that
_datastore_api.put()
is doing to maintain cache consistency, as the stale value is just left in memcache.pymemcache by default adds noreply to many commands -- in this case the problematic ones were set_many / delete_many. It is apparently not at all unlikely for a set or delete to fail, in which case pymemcache just silently suppresses the failure. If noreply=True is added to the call, then it will return a list of keys that failed from set_many, and I believe for delete_many it will raise an exception if any fail. Failing the delete silently is not a critical issue I suppose, though it will then prevent all readers from being able to set and thus subsequently get from memcache for that key until the lock expires, which takes 32s.
My suggestion would be to add
noreply=False
to the call toset_many()
and raise a typed exception in the Futures that failed to set. The exception type would be added to the tuple of transient errors in MemcacheCache, and as such, everything will just sort of work itself out. I would also addnoreply=False
to the call todelete_many()
, though I'm more unclear about its behavior in that case. I've not yet tried it so I do not know what exception it will throw on failure, but it could also be added to the transient errors tuple. The downside is that it does not tell you which key(s) failed so the exception will be thrown for all Futures.The text was updated successfully, but these errors were encountered: