fix: prevent overwrite of cache lock value #667

chrisrossi · 2021-06-02T20:58:58Z

chrisrossi · 2021-06-02T20:59:56Z

~~Note, that this is currently based on PR #665 (the fix-656 branch).~~

chrisrossi · 2021-06-02T21:04:08Z

The spelling is a little different and it's less memcached specific, but this is more or less the solution proposed in #651.

justinkwaugh · 2021-06-03T16:03:41Z

google/cloud/ndb/_datastore_api.py

@@ -146,8 +146,15 @@ def lookup(key, options):
                entity_pb.MergeFromString(result)

            elif use_datastore:
-                yield _cache.global_lock(cache_key, read=True)
-                yield _cache.global_watch(cache_key)
+                lock_acquired = yield _cache.global_lock(cache_key, read=True)


Transient errors that can't resolve and result in an exception should probably just be caught here and key_locked set to True as well, no need to bubble them up and fail things

In writing this comment I just remembered that handle_transient_errors doesn't raise for read operations, which is the source of the problem described in #652, but which when addressed would then want a try/except here as described above. I had implemented a parameter to handle_transient_errors which would force it to raise in all cases.

ebb2a1c

I believe this also addresses #652

Agreed because it will return None, however I do think a try/except is appropriate for non-transient errors. Just because something fails with the cache even catastrophically should not prevent a read thread from going to the db for the value, it should just prevent it from writing back to the cache.

justinkwaugh · 2021-06-03T16:05:59Z

google/cloud/ndb/_cache.py

+        result = cache_call.result()
+        if result:
+            for key, future in self.futures.items():
+                key_result = result.get(key, None)


Should the default value be None or False in the case of set_if_not_exists?

I don't think so. In the case of set_if_not_exists, there will be a result for the key, so the default won't be used.

To me it's more of an api contractual thing. Both setnx and memcache.add return true if it was set and false if it was not. So what does None mean? If it is just to be considered Falsey then it is probably ok.

This code is also used by set not just set_if_not_exists. In the case of set_if_not_exists, it will always follow the path that eventually returns a boolean value. None won't ever be returned for set_if_not_exists.

justinkwaugh · 2021-06-03T16:06:18Z

google/cloud/ndb/_cache.py

+                    future.set_result(key_result)
+        else:
+            for future in self.futures.values():
+                future.set_result(None)


Same question here... None or False for set_if_not_exists

I don't think so. In the case of set_if_not_exists, there will be a result.

google/cloud/ndb/_cache.py

google/cloud/ndb/global_cache.py

google/cloud/ndb/_cache.py

chrisrossi · 2021-06-08T20:09:33Z

I think the situation is much improved, but I also think there's still a way for things to get out of sync. Namely, in _datastore_api.lookup, if a key is updated by another thread between the call to _cache.global_lock and _cache.global_watch, we could end up watching a value that we didn't set. If my reasoning is off, here, let me know.

I think to solve this, we can probably modify the watch command to take the value we just set as an expected value so we can make sure when we watch it, that the key is still set to that value. I think it may also be necessary to change the lock value to be unique per thread (using uuid, or similar) so that threads can distinguish their own lock from a lock from another thread.

justinkwaugh · 2021-06-08T20:32:26Z

Yes, and changing the way the lock value works would align with the needs of #653. In reality I don't think it would matter if two read threads conflict in the way you described as it is a race condition that either should be allowed to win, but should a write thread acquire a lock no read should ever watch it. And then for write threads it is ok for multiple to lock as long as it is not considered unlocked until the last write thread finishes.

chrisrossi · 2021-06-08T20:39:36Z

Yes, and changing the way the lock value works would align with the needs of #653. In reality I don't think it would matter if two read threads conflict in the way you described as it is a race condition that either should be allowed to win, but should a write thread acquire a lock no read should ever watch it.

Exactly. And it looks like it would address #653 as well.

chrisrossi · 2021-06-08T20:41:18Z

Possibly instead of a unique lock value per thread, we could just have two lock values, one for read and one for write.

justinkwaugh · 2021-06-08T21:01:35Z

Possibly instead of a unique lock value per thread, we could just have two lock values, one for read and one for write.

The challenge is just whether or not you want write locks to be re-entrant (by multiple threads) or not. If so then they need to act like a semaphore to ensure that you only "unlock" at the end of all concurrent write threads finishing.

justinkwaugh · 2021-06-08T21:04:43Z

Thanks for doing all this work, with #653 fixed I believe it will have correct cache consistency, which is a far cry from what it was a few weeks ago.

chrisrossi · 2021-06-09T18:41:43Z

Possibly instead of a unique lock value per thread, we could just have two lock values, one for read and one for write.

The challenge is just whether or not you want write locks to be re-entrant (by multiple threads) or not. If so then they need to act like a semaphore to ensure that you only "unlock" at the end of all concurrent write threads finishing.

Right another idea I had was instead of deleting, using CAS to write an "unlocked" or "deleted" value,

chrisrossi · 2021-06-09T19:07:14Z

Didn't mean to close.

justinkwaugh · 2021-06-09T21:01:11Z

Right another idea I had was instead of deleting, using CAS to write an "unlocked" or "deleted" value,

That can only work if CAS is also used to set the lock value, which is a pessimistic sort of lock, otherwise one write thread can lock and then another write thread can lock and unlock before the first is done. In reality we are not trying to prevent write threads from concurrent locking, just trying to ensure that the cache is locked to prevent read threads from updating it for the duration of any write operation / transaction.

Fixes googleapis#651

chrisrossi · 2021-07-19T19:25:08Z

How is this scenario handled:

Reader 1 gets a cache miss.

Reader 1 writes a read lock.

Writer writes a write lock.

Reader 1 reads value v1 from data-store.

Writer writes value v2 to data-store.

Writer deletes the write lock.

Reader 2 gets a cache miss.

Reader 2 writes a read lock.

Reader 1 writes stale v1 to the data-store. CAS allows this because
the value is the same.

It looks like we need read lock ids.

BTW, I think there should be tests that exercise scenarios like this
and those identified in the issues.

FTR, I thought of this scenario while writing locking documentation.
I'm a huge fan of documentation-driven development. :)

I'll start thinking about some integration tests specifically to target these scenarios. Naming them will be fun.

jimfulton · 2021-07-19T19:37:16Z

How is this scenario handled:
...
I'll start thinking about some integration tests specifically to target these scenarios. Naming them will be fun.

I was mainly pointing out that I think there's a bug. Do you think the scenario I mentioned is handled correctly?

justinkwaugh · 2021-07-19T20:49:31Z

Are you sure cas is value based?

…

On Jul 19, 2021, at 1:23 PM, Chris Rossi ***@***.***> wrote: @chrisrossi commented on this pull request. In tests/unit/test__transaction.py: > @@ -407,35 +474,6 @@ def callback(): assert future.result() == "I tried, momma." - @staticmethod Well "test_success_w_callbacks" tests that the callbacks get called, whatever they are. The callbacks that remove the affected keys from the cache are found in _datastore_api.py and are tested in test__datastore.api here. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

justinkwaugh · 2021-07-19T20:52:56Z

What I should say is even if a value that is the same is added, because it was deleted and re added it will have a new cas value

…

On Jul 19, 2021, at 1:37 PM, Jim Fulton ***@***.***> wrote: How is this scenario handled: ... I'll start thinking about some integration tests specifically to target these scenarios. Naming them will be fun. I was mainly pointing out that I think there's a bug. Do you think the scenario I mentioned is handled correctly? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

jimfulton · 2021-07-20T13:47:14Z

What I should say is even if a value that is the same is added, because it was deleted and re added it will have a new cas value

I am not a Redis expert, so I'm willing to believe that at some low level, Redis assigns watches ids that can be used to detect something like this :) , however, there's no such watch id floating around in the ndb Python code (or in the Python redis API AFAICT). There are just keys and values.

Here's a script that demonstrates the problem:

from google.cloud.ndb.global_cache import RedisCache
from google.cloud.ndb import _cache
from google.cloud.ndb import Client

cache = RedisCache.from_environment()
client = Client()
context = client.context(global_cache=cache)
context.__enter__()

cache_key = 'key'
cache.delete([cache_key]) # Make sure it's absent

# Read client 1
assert _cache.global_get(cache_key).result() is None
read_lock = _cache.global_lock_for_read(cache_key).result()
assert read_lock == _cache._LOCKED_FOR_READ
assert _cache.global_watch(cache_key, read_lock).result() is None
# Read b'foo' from datastore

# Write client
write_lock = _cache.global_lock_for_write(cache_key).result()
assert _cache.global_get(cache_key).result() == _cache._LOCKED_FOR_WRITE + write_lock
# write b'bar' to datastore
assert _cache.global_unlock_for_write(cache_key, write_lock).result() is None

# Read client 2
# Look ma, no lock!
assert _cache.global_get(cache_key).result() is None
read_lock2 = _cache.global_lock_for_read(cache_key).result()
assert read_lock2 == _cache._LOCKED_FOR_READ
assert _cache.global_watch(cache_key, read_lock2).result() is None

# Read client 1
_cache.global_compare_and_swap(cache_key, b'foo', expires=999).result()
assert _cache.global_get(cache_key).result() == b'foo'
# Cache now has old value

Am I missing something?

justinkwaugh · 2021-07-20T13:57:08Z

I am unfamiliar with redis and was looking at it from a memcache viewpoint which does use separate values. I’m a little surprised that WATCH/EXEC would not fail the transaction based on the documentation, but if it shown to not work then I guess not.

…

On Jul 20, 2021, at 7:47 AM, Jim Fulton ***@***.***> wrote: What I should say is even if a value that is the same is added, because it was deleted and re added it will have a new cas value I am not a Redis expert, so I'm willing to believe that at some low level, Redis assigns watches ids that can be used to detect something like this :) , however, there's no such watch id floating around in the ndb Python code (or in the Python redis API AFAICT). There are just keys and values. Here's a script that demonstrates the problem: from google.cloud.ndb.global_cache import RedisCache from google.cloud.ndb import _cache from google.cloud.ndb import Client cache = RedisCache.from_environment() client = Client() context = client.context(global_cache=cache) context.__enter__() cache_key = 'key' cache.delete([cache_key]) # Make sure it's absent # Read client 1 assert _cache.global_get(cache_key).result() is None read_lock = _cache.global_lock_for_read(cache_key).result() assert read_lock == _cache._LOCKED_FOR_READ assert _cache.global_watch(cache_key, read_lock).result() is None # Read b'foo' from datastore # Write client write_lock = _cache.global_lock_for_write(cache_key).result() assert _cache.global_get(cache_key).result() == _cache._LOCKED_FOR_WRITE + write_lock # write b'bar' to datastore assert _cache.global_unlock_for_write(cache_key, write_lock).result() is None # Read client 2 # Look ma, no lock! assert _cache.global_get(cache_key).result() is None read_lock2 = _cache.global_lock_for_read(cache_key).result() assert read_lock2 == _cache._LOCKED_FOR_READ assert _cache.global_watch(cache_key, read_lock2).result() is None # Read client 1 _cache.global_compare_and_swap(cache_key, b'foo', expires=999).result() assert _cache.global_get(cache_key).result() == b'foo' # Cache now has old value Am I missing something? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

justinkwaugh · 2021-07-20T14:00:07Z

Note that your test is not actually using separate clients, or threads. Doing an EXEC command will automatically unwatch the keys, so a set of serial operations beyond that will not be part of the transaction

…

On Jul 20, 2021, at 7:47 AM, Jim Fulton ***@***.***> wrote: What I should say is even if a value that is the same is added, because it was deleted and re added it will have a new cas value I am not a Redis expert, so I'm willing to believe that at some low level, Redis assigns watches ids that can be used to detect something like this :) , however, there's no such watch id floating around in the ndb Python code (or in the Python redis API AFAICT). There are just keys and values. Here's a script that demonstrates the problem: from google.cloud.ndb.global_cache import RedisCache from google.cloud.ndb import _cache from google.cloud.ndb import Client cache = RedisCache.from_environment() client = Client() context = client.context(global_cache=cache) context.__enter__() cache_key = 'key' cache.delete([cache_key]) # Make sure it's absent # Read client 1 assert _cache.global_get(cache_key).result() is None read_lock = _cache.global_lock_for_read(cache_key).result() assert read_lock == _cache._LOCKED_FOR_READ assert _cache.global_watch(cache_key, read_lock).result() is None # Read b'foo' from datastore # Write client write_lock = _cache.global_lock_for_write(cache_key).result() assert _cache.global_get(cache_key).result() == _cache._LOCKED_FOR_WRITE + write_lock # write b'bar' to datastore assert _cache.global_unlock_for_write(cache_key, write_lock).result() is None # Read client 2 # Look ma, no lock! assert _cache.global_get(cache_key).result() is None read_lock2 = _cache.global_lock_for_read(cache_key).result() assert read_lock2 == _cache._LOCKED_FOR_READ assert _cache.global_watch(cache_key, read_lock2).result() is None # Read client 1 _cache.global_compare_and_swap(cache_key, b'foo', expires=999).result() assert _cache.global_get(cache_key).result() == b'foo' # Cache now has old value Am I missing something? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

chrisrossi · 2021-07-20T14:02:20Z

How is this scenario handled:
...
I'll start thinking about some integration tests specifically to target these scenarios. Naming them will be fun.

I was mainly pointing out that I think there's a bug. Do you think the scenario I mentioned is handled correctly?

Sorry, I meant to respond earlier. @justinkwaugh is correct about memcache. I was going to test Redis behavior this morning, as it's not clear what it would do.

As far as tests, I was just referring back to an earlier comment you made about testing all these complex scenarios, which had been nagging me as well. Obviously, orchestrating all the different parts to test the intersection of concurrency and fault tolerance is complex, which is why it hasn't been done yet. The more of these that come up, though, the more worthwhile it seems to go ahead and bite the bullet and try to figure it out.

If we're reasonably satisfied this PR is working, though, I can always make that a separate effort under a new Issue/PR.

jimfulton · 2021-07-20T14:04:43Z

I am unfamiliar with redis and was looking at it from a memcache viewpoint which does use separate values. I’m a little surprised that WATCH/EXEC would not fail the transaction based on the documentation, but if it shown to not work then I guess not.

Even if memcache does make it possible to catch this, our global cache API, GlobalCache, doesn't allow us to leverage it.

Anyway, the fix is easy.

I still think we need to find a way to test these scenarios (as my script "tests" the bug).

chrisrossi · 2021-07-20T14:11:19Z

I am unfamiliar with redis and was looking at it from a memcache viewpoint which does use separate values. I’m a little surprised that WATCH/EXEC would not fail the transaction based on the documentation, but if it shown to not work then I guess not.

Even if memcache does make it possible to catch this, our global cache API, GlobalCache, doesn't allow us to leverage it.

Anyway, the fix is easy.

I still think we need to find a way to test these scenarios (as my script "tests" the bug).

I agree.

justinkwaugh · 2021-07-20T14:14:55Z

Tests are certainly warranted, but I do think your “test” may be invalid for the reason I mentioned, though if that is how the library would be used in practice that could be because of improper lifecycle or use of pipelines rather than a need for a different lock value

…

On Jul 20, 2021, at 8:04 AM, Jim Fulton ***@***.***> wrote: I am unfamiliar with redis and was looking at it from a memcache viewpoint which does use separate values. I’m a little surprised that WATCH/EXEC would not fail the transaction based on the documentation, but if it shown to not work then I guess not. Even if memcache does make it possible to catch this, our global cache API, GlobalCache, doesn't allow us to leverage it. Anyway, the fix is easy. I still think we need to find a way to test these scenarios (as my script "tests" the bug). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

jimfulton · 2021-07-20T14:15:55Z

...

Sorry, I meant to respond earlier. @justinkwaugh is correct about memcache. I was going to test Redis behavior this morning, as it's not clear what it would do.

See my script. I'm 97% sure it would demonstrate the same problem with memcache, as our API doesn't expose any sort of watch id (assuming that memcache has something like that).

As far as tests, I was just referring back to an earlier comment you made about testing all these complex scenarios, which had been nagging me as well. Obviously, orchestrating all the different parts to test the intersection of concurrency and fault tolerance is complex, which is why it hasn't been done yet. The more of these that come up, though, the more worthwhile it seems to go ahead and bite the bullet and try to figure it out.

This code is really complicated and the locking logic is spread out. We test bits and pieces of it in isolation and rely on reasoning for correctness. I don't really trust that approach.

It would be hard to test this at the user-facing level, because it would be too hard to control timing of calls... Although maybe not with the tests providing their own "event loop" logic.

I think my code was able to demonstrate a locking bug pretty easily. Maybe it would be possible to provide a higher-level test, using user-level calls if the tests controlled the event-loop/future evaluations. Let me know if you'd like me to take a crack at that.

If we're reasonably satisfied this PR is working, though, I can always make that a separate effort under a new Issue/PR.

I've demonstrated a bug. I defer whether to fix it in a separate PR.

justinkwaugh · 2021-07-20T14:21:23Z

You are incorrect about it not working in memcache, and I’m a little suspect of your demonstration of a bug but that doesn’t matter. Please test away

…

On Jul 20, 2021, at 8:16 AM, Jim Fulton ***@***.***> wrote: ... Sorry, I meant to respond earlier. @justinkwaugh is correct about memcache. I was going to test Redis behavior this morning, as it's not clear what it would do. See my script. I'm 97% sure it would demonstrate the same problem with memcache, as our API doesn't expose any sort of watch id (assuming that memcache has something like that). As far as tests, I was just referring back to an earlier comment you made about testing all these complex scenarios, which had been nagging me as well. Obviously, orchestrating all the different parts to test the intersection of concurrency and fault tolerance is complex, which is why it hasn't been done yet. The more of these that come up, though, the more worthwhile it seems to go ahead and bite the bullet and try to figure it out. This code is really complicated and the locking logic is spread out. We test bits and pieces of it in isolation and rely on reasoning for correctness. I don't really trust that approach. It would be hard to test this at the user-facing level, because it would be too hard to control timing of calls... Although maybe not with the tests providing their own "event loop" logic. I think my code was able to demonstrate a locking bug pretty easily. Maybe it would be possible to provide a higher-level test, using user-level calls if the tests controlled the event-loop/future evaluations. Let me know if you'd like me to take a crack at that. If we're reasonably satisfied this PR is working, though, I can always make that a separate effort under a new Issue/PR. I've demonstrated a bug. I defer whether to fix it in a separate PR. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

jimfulton · 2021-07-20T14:32:58Z

Note that your test is not actually using separate clients, or threads. Doing an EXEC command will automatically unwatch the keys, so a set of serial operations beyond that will not be part of the transaction

Good point!

Modifying the script to use different redis clients:

from google.cloud.ndb.global_cache import RedisCache
from google.cloud.ndb import _cache
from google.cloud.ndb import Client

cacher1 = RedisCache.from_environment()
cacher2 = RedisCache.from_environment()
cachew = RedisCache.from_environment()

cache_key = 'key'
cacher1.delete([cache_key]) # Make sure it's absent

client = Client()

with client.context(global_cache=cacher1):
    # Read client 1
    assert _cache.global_get(cache_key).result() is None
    read_lock = _cache.global_lock_for_read(cache_key).result()
    assert read_lock == _cache._LOCKED_FOR_READ
    assert _cache.global_watch(cache_key, read_lock).result() is None
    # Read b'foo' from datastore

with client.context(global_cache=cachew):
    # Write client
    write_lock = _cache.global_lock_for_write(cache_key).result()
    assert _cache.global_get(cache_key).result() == (
        _cache._LOCKED_FOR_WRITE + write_lock)
    # write b'bar' to datastore
    assert _cache.global_unlock_for_write(cache_key, write_lock).result() is None

with client.context(global_cache=cacher2):
    # Read client 2
    # Look ma, no lock!
    assert _cache.global_get(cache_key).result() is None
    read_lock2 = _cache.global_lock_for_read(cache_key).result()
    assert read_lock2 == _cache._LOCKED_FOR_READ
    assert _cache.global_watch(cache_key, read_lock2).result() is None

with client.context(global_cache=cacher1):
    # Read client 1
    r = _cache.global_compare_and_swap(cache_key, b'foo', expires=999).result()
    assert r  # Fails because r is False

The compare and swap fails. So really, it's the client itself that keeps track of the watch state internally.

So it looks like there isn't a bug, at least for Redis.

justinkwaugh · 2021-07-20T14:44:32Z

Ah good. I’m not certain of the client lifecycle though in a multithreaded instance so it’s still worth testing!

…

On Jul 20, 2021, at 8:33 AM, Jim Fulton ***@***.***> wrote: Note that your test is not actually using separate clients, or threads. Doing an EXEC command will automatically unwatch the keys, so a set of serial operations beyond that will not be part of the transaction Good point! Modifying the script to use different redis clients: from google.cloud.ndb.global_cache import RedisCache from google.cloud.ndb import _cache from google.cloud.ndb import Client cacher1 = RedisCache.from_environment() cacher2 = RedisCache.from_environment() cachew = RedisCache.from_environment() cache_key = 'key' cacher1.delete([cache_key]) # Make sure it's absent client = Client() with client.context(global_cache=cacher1): # Read client 1 assert _cache.global_get(cache_key).result() is None read_lock = _cache.global_lock_for_read(cache_key).result() assert read_lock == _cache._LOCKED_FOR_READ assert _cache.global_watch(cache_key, read_lock).result() is None # Read b'foo' from datastore with client.context(global_cache=cachew): # Write client write_lock = _cache.global_lock_for_write(cache_key).result() assert _cache.global_get(cache_key).result() == ( _cache._LOCKED_FOR_WRITE + write_lock) # write b'bar' to datastore assert _cache.global_unlock_for_write(cache_key, write_lock).result() is None with client.context(global_cache=cacher2): # Read client 2 # Look ma, no lock! assert _cache.global_get(cache_key).result() is None read_lock2 = _cache.global_lock_for_read(cache_key).result() assert read_lock2 == _cache._LOCKED_FOR_READ assert _cache.global_watch(cache_key, read_lock2).result() is None with client.context(global_cache=cacher1): # Read client 1 r = _cache.global_compare_and_swap(cache_key, b'foo', expires=999).result() assert r # Fails because r is False The compare and swap fails. So really, it's the client itself that keeps track of the watch state internally. So it looks like there isn't a bug, at least for Redis. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

chrisrossi · 2021-07-20T14:45:24Z

What I should say is even if a value that is the same is added, because it was deleted and re added it will have a new cas value

I am not a Redis expert, so I'm willing to believe that at some low level, Redis assigns watches ids that can be used to detect something like this :) , however, there's no such watch id floating around in the ndb Python code (or in the Python redis API AFAICT). There are just keys and values.

Here's a script that demonstrates the problem:

from google.cloud.ndb.global_cache import RedisCache
from google.cloud.ndb import _cache
from google.cloud.ndb import Client

cache = RedisCache.from_environment()
client = Client()
context = client.context(global_cache=cache)
context.__enter__()

cache_key = 'key'
cache.delete([cache_key]) # Make sure it's absent

# Read client 1
assert _cache.global_get(cache_key).result() is None
read_lock = _cache.global_lock_for_read(cache_key).result()
assert read_lock == _cache._LOCKED_FOR_READ
assert _cache.global_watch(cache_key, read_lock).result() is None
# Read b'foo' from datastore

# Write client
write_lock = _cache.global_lock_for_write(cache_key).result()
assert _cache.global_get(cache_key).result() == _cache._LOCKED_FOR_WRITE + write_lock
# write b'bar' to datastore
assert _cache.global_unlock_for_write(cache_key, write_lock).result() is None

# Read client 2
# Look ma, no lock!
assert _cache.global_get(cache_key).result() is None
read_lock2 = _cache.global_lock_for_read(cache_key).result()
assert read_lock2 == _cache._LOCKED_FOR_READ
assert _cache.global_watch(cache_key, read_lock2).result() is None

# Read client 1
_cache.global_compare_and_swap(cache_key, b'foo', expires=999).result()
assert _cache.global_get(cache_key).result() == b'foo'
# Cache now has old value

Am I missing something?

I think so, yes. This test doesn't really test concurrency because all of the "threads" are being run in the same context, so they're sharing the same threadlocal "RedisCache.pipes".

I think this demonstrates that Redis protects against concurrent writes of the same value, and the code currently does the Right Thing (tm):

import redis

client = redis.StrictRedis()
client.set(b"foo", b"boo")
assert client.get(b"foo") == b"boo"

# Normal case, no race condition
pipe = client.pipeline()
pipe.watch(b"foo")
pipe.multi()
pipe.set(b"foo", b"bar")
pipe.execute()
pipe.reset()

assert client.get(b"foo") == b"bar"

# Concurrently set the same value
pipe = client.pipeline()
pipe.watch(b"foo")
client.set(b"foo", b"baz")  # Set outside of watch pipeline
pipe.multi()
pipe.set(b"foo", b"baz")    # Set same value in watch pipeline

try:
    pipe.execute()
    assert False, "Shouldn't reach this line"

except redis.exceptions.WatchError:
    assert True, "Should raise WatchError"

finally:
    pipe.reset()

That said, I'm sympathetic to the idea that since we have a more or less pluggable architecture here that it would be better to not make this part of the contract that the cache implementers must fulfill, especially if the "fix" is relatively easy. I'm +0 on that, I suppose.

chrisrossi · 2021-07-20T14:49:33Z

And just for good measure, here it is with two pipelines, more like what would actually happen irl:

import redis

client = redis.StrictRedis()
client.set(b"foo", b"boo")
assert client.get(b"foo") == b"boo"

# Normal case, no race condition
pipe = client.pipeline()
pipe.watch(b"foo")
pipe.multi()
pipe.set(b"foo", b"bar")
pipe.execute()
pipe.reset()

assert client.get(b"foo") == b"bar"

# Concurrently set the same value
pipe1 = client.pipeline()
pipe.watch(b"foo")

pipe2 = client.pipeline()
pipe2.watch(b"foo")

pipe1.multi()
pipe1.set(b"foo", b"baz")

pipe2.multi()
pipe2.set(b"foo", b"baz")

pipe1.execute()

try:
    pipe2.execute()
    assert False, "Shouldn't reach this line"

except redis.exceptions.WatchError:
    assert True, "Should raise WatchError"

pipe1.reset()
pipe2.reset()

assert client.get(b"foo") == b"baz"

chrisrossi · 2021-07-20T14:53:36Z

See my script. I'm 97% sure it would demonstrate the same problem with memcache, as our API doesn't expose any sort of watch id (assuming that memcache has something like that).

If you look at the Memcache implementation, we get and keep track of a CAS id that comes from memcache. I'm pretty sure we don't have this problem with memcache. (Or Redis.)

This code is really complicated and the locking logic is spread out. We test bits and pieces of it in isolation and rely on reasoning for correctness. I don't really trust that approach.

It would be hard to test this at the user-facing level, because it would be too hard to control timing of calls... Although maybe not with the tests providing their own "event loop" logic.

I think my code was able to demonstrate a locking bug pretty easily. Maybe it would be possible to provide a higher-level test, using user-level calls if the tests controlled the event-loop/future evaluations. Let me know if you'd like me to take a crack at that.

Well, I'm not sure that it did. Testing this properly is actually a fairly hard problem.

chrisrossi · 2021-07-20T15:01:22Z

Sorry for replying to already out of date comments. Yet another concurrency issue. ;-)

chrisrossi · 2021-07-20T17:15:45Z

That said, I'm sympathetic to the idea that since we have a more or less pluggable architecture here that it would be better to not make this part of the contract that the cache implementers must fulfill, especially if the "fix" is relatively easy. I'm +0 on that, I suppose.

It was easy, so why not? b77c2b9

chrisrossi requested review from tseaver, crwilcox, jimfulton and craiglabenz June 2, 2021 20:58

chrisrossi requested a review from andrewsg as a code owner June 2, 2021 20:58

chrisrossi requested a review from a team June 2, 2021 20:58

chrisrossi requested a review from a team as a code owner June 2, 2021 20:58

product-auto-label bot added the api: datastore Issues related to the googleapis/python-ndb API. label Jun 2, 2021

google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Jun 2, 2021

justinkwaugh reviewed Jun 3, 2021

View reviewed changes

tseaver reviewed Jun 7, 2021

View reviewed changes

google/cloud/ndb/_cache.py Show resolved Hide resolved

google/cloud/ndb/global_cache.py Outdated Show resolved Hide resolved

tseaver reviewed Jun 7, 2021

View reviewed changes

google/cloud/ndb/_cache.py Show resolved Hide resolved

chrisrossi force-pushed the fix-651 branch from bbb1899 to 0f3d73a Compare June 7, 2021 17:26

chrisrossi closed this Jun 9, 2021

chrisrossi reopened this Jun 9, 2021

chrisrossi marked this pull request as draft June 9, 2021 19:07

chrisrossi force-pushed the fix-651 branch from 0cd7a49 to a116bee Compare June 15, 2021 20:30

chrisrossi marked this pull request as ready for review June 15, 2021 20:31

fix: prevent overwrite of cache lock value

424d1bd

Fixes googleapis#651

tseaver mentioned this pull request Jul 19, 2021

chore: release 1.10.0 #686

Merged

Better test name?

fa67d2c

Make read locks unique.

b77c2b9

jimfulton approved these changes Jul 20, 2021

View reviewed changes

Merge branch 'master' into fix-651

f2747c7

chrisrossi merged commit 5e2c591 into googleapis:master Jul 20, 2021

chrisrossi deleted the fix-651 branch July 20, 2021 18:12

This was referenced Jul 20, 2021

Cloud NDB - Get operation ignores transient locking failure in memcache leading to cache inconsistency #652

Closed

Cloud NDB - Overwriting of lock on writes can lead to cache inconsistency #653

Closed

jimfulton mentioned this pull request Aug 7, 2021

fix: fix bug with concurrent writes to global cache #705

Merged

fix: prevent overwrite of cache lock value #667

fix: prevent overwrite of cache lock value #667

Conversation

chrisrossi commented Jun 2, 2021 • edited Loading

chrisrossi commented Jun 2, 2021 • edited Loading

chrisrossi commented Jun 2, 2021

justinkwaugh Jun 3, 2021

Choose a reason for hiding this comment

justinkwaugh Jun 3, 2021 • edited Loading

Choose a reason for hiding this comment

chrisrossi Jun 8, 2021

Choose a reason for hiding this comment

justinkwaugh Jun 8, 2021 • edited Loading

Choose a reason for hiding this comment

justinkwaugh Jun 3, 2021

Choose a reason for hiding this comment

chrisrossi Jun 4, 2021

Choose a reason for hiding this comment

justinkwaugh Jun 4, 2021 • edited Loading

Choose a reason for hiding this comment

chrisrossi Jun 4, 2021

Choose a reason for hiding this comment

justinkwaugh Jun 3, 2021

Choose a reason for hiding this comment

chrisrossi Jun 4, 2021

Choose a reason for hiding this comment

chrisrossi commented Jun 8, 2021

justinkwaugh commented Jun 8, 2021 • edited Loading

chrisrossi commented Jun 8, 2021

chrisrossi commented Jun 8, 2021

justinkwaugh commented Jun 8, 2021 • edited Loading

justinkwaugh commented Jun 8, 2021

chrisrossi commented Jun 9, 2021

chrisrossi commented Jun 9, 2021

justinkwaugh commented Jun 9, 2021 • edited Loading

chrisrossi commented Jul 19, 2021

jimfulton commented Jul 19, 2021

justinkwaugh commented Jul 19, 2021 via email

justinkwaugh commented Jul 19, 2021 via email

jimfulton commented Jul 20, 2021

justinkwaugh commented Jul 20, 2021 via email

justinkwaugh commented Jul 20, 2021 via email

chrisrossi commented Jul 20, 2021

jimfulton commented Jul 20, 2021

chrisrossi commented Jul 20, 2021

justinkwaugh commented Jul 20, 2021 via email

jimfulton commented Jul 20, 2021

justinkwaugh commented Jul 20, 2021 via email

jimfulton commented Jul 20, 2021

justinkwaugh commented Jul 20, 2021 via email

chrisrossi commented Jul 20, 2021

chrisrossi commented Jul 20, 2021

chrisrossi commented Jul 20, 2021

chrisrossi commented Jul 20, 2021

chrisrossi commented Jul 20, 2021

chrisrossi commented Jun 2, 2021 •

edited

Loading

chrisrossi commented Jun 2, 2021 •

edited

Loading

justinkwaugh Jun 3, 2021 •

edited

Loading

justinkwaugh Jun 8, 2021 •

edited

Loading

justinkwaugh Jun 4, 2021 •

edited

Loading

justinkwaugh commented Jun 8, 2021 •

edited

Loading

justinkwaugh commented Jun 8, 2021 •

edited

Loading

justinkwaugh commented Jun 9, 2021 •

edited

Loading