Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CRASH] Cluster Mode Cross-Slot RANDOMKEY in MULTI-EXEC #1145

Closed
nadav-levanoni opened this issue Oct 10, 2024 · 10 comments · Fixed by #1155
Closed

[CRASH] Cluster Mode Cross-Slot RANDOMKEY in MULTI-EXEC #1145

nadav-levanoni opened this issue Oct 10, 2024 · 10 comments · Fixed by #1155

Comments

@nadav-levanoni
Copy link
Contributor

Crash report

Logged crash report (pid 23580):
=== VALKEY BUG REPORT START: Cut & paste starting from here ===
23580:M 10 Oct 2024 19:36:19.133 # === ASSERTION FAILED CLIENT CONTEXT ===
23580:M 10 Oct 2024 19:36:19.133 # client->flags = 18023196754182152
23580:M 10 Oct 2024 19:36:19.133 # client->conn = fd=12
23580:M 10 Oct 2024 19:36:19.133 # client->argc = 1
23580:M 10 Oct 2024 19:36:19.133 # client->argv[0] = "RANDOMKEY" (refcount: 1)
23580:M 10 Oct 2024 19:36:19.133 # === RECURSIVE ASSERTION FAILED ===
23580:M 10 Oct 2024 19:36:19.133 # ==> db.c:246 '(int)keyHashSlot(key, (int)sdslen(key)) == server.current_client->slot' is not true

------ STACK TRACE ------

23587 bio_lazy_free
/lib64/libpthread.so.0(pthread_cond_wait+0x21c)[0xffff9c212e3c]
src/valkey-server 127.0.0.1:21116 [cluster](bioProcessBackgroundJobs+0x144)[0x4cc0f4]
/lib64/libpthread.so.0(+0x7230)[0xffff9c20c230]
/lib64/libc.so.6(+0xdb7dc)[0xffff9c15a7dc]

23586 bio_aof
/lib64/libpthread.so.0(pthread_cond_wait+0x21c)[0xffff9c212e3c]
src/valkey-server 127.0.0.1:21116 [cluster](bioProcessBackgroundJobs+0x144)[0x4cc0f4]
/lib64/libpthread.so.0(+0x7230)[0xffff9c20c230]
/lib64/libc.so.6(+0xdb7dc)[0xffff9c15a7dc]

23585 bio_close_file
/lib64/libpthread.so.0(pthread_cond_wait+0x21c)[0xffff9c212e3c]
src/valkey-server 127.0.0.1:21116 [cluster](bioProcessBackgroundJobs+0x144)[0x4cc0f4]
/lib64/libpthread.so.0(+0x7230)[0xffff9c20c230]
/lib64/libc.so.6(+0xdb7dc)[0xffff9c15a7dc]

23580 valkey-server *
src/valkey-server 127.0.0.1:21116 [cluster](getKeySlot+0x214)[0x469654]
src/valkey-server 127.0.0.1:21116 [cluster][0x46968c]
src/valkey-server 127.0.0.1:21116 [cluster](dbRandomKey+0xac)[0x470db8]
src/valkey-server 127.0.0.1:21116 [cluster](randomkeyCommand+0x18)[0x470e7c]
src/valkey-server 127.0.0.1:21116 [cluster](call+0x760)[0x5461c0]
src/valkey-server 127.0.0.1:21116 [cluster](execCommand+0x268)[0x45d4f0]
src/valkey-server 127.0.0.1:21116 [cluster](call+0x760)[0x5461c0]
src/valkey-server 127.0.0.1:21116 [cluster](processCommand+0xb7c)[0x497d1c]
src/valkey-server 127.0.0.1:21116 [cluster](processInputBuffer+0x180)[0x4b7668]
src/valkey-server 127.0.0.1:21116 [cluster](readQueryFromClient+0x50)[0x4b9c50]
src/valkey-server 127.0.0.1:21116 [cluster][0x53c6b8]
src/valkey-server 127.0.0.1:21116 [cluster](aeProcessEvents+0xdc)[0x574d80]
src/valkey-server 127.0.0.1:21116 [cluster](aeMain+0x28)[0x5753ac]
src/valkey-server 127.0.0.1:21116 [cluster](main+0x3ac)[0x45090c]
/lib64/libc.so.6(__libc_start_main+0xe4)[0xffff9c09eda4]
src/valkey-server 127.0.0.1:21116 [cluster][0x450ff0]


### Additional information

```shell
Fails the following test

start_cluster 1 0 {tags {external:skip cluster}} {
    test "Regression test for multi-exec with RANDOMKEY accessing the wrong per-slot dicitonary" {
        R 0 SETEX FOO 10000 BAR
        R 0 SETEX FIZZ 10000 BUZZ

        R 0 MULTI
        R 0 GET FOO
        R 0 RANDOMKEY
        set result [R 0 EXEC]
        assert_equal $result {{} FOO}
    }
}
@nadav-levanoni
Copy link
Contributor Author

Will raise a PR soon to fix this issue.

@madolson
Copy link
Member

Just to mention, this won't crash for most production users since it uses the debug assert.

@nadav-levanoni
Copy link
Contributor Author

Not in the scope of this issue, but if we refactor the expiry API to optionally accept a slot (or create a new API), we can potentially use the slot cached in the client if a slot is not provided. Would need to take a closer look to see if that's possible. It would improve the look-up perf by a decent amount.

@madolson
Copy link
Member

The behavior today is that we use the slot on the client as an optimization. The assertion we are hitting is that there was a mismatch between the cached slot and the real slot, which is only checked when debug assertions are enabled.

@enjoy-binbin
Copy link
Member

@madolson in this case, it look like we are returning the wrong slot, like we are returning the cached slot (current->client->slot) and not the real slot (do nothing with the debug assert)

@zuiderkwast
Copy link
Contributor

We need dbRandomKey to bypass the optimization in getKeySlot in some way. A simple hack would be to set server.current_client->slot = -1 temporarily and restore it afterwards.

@nadav-levanoni What's the fix you have in mind?

@madolson
Copy link
Member

madolson commented Oct 11, 2024

@madolson in this case, it look like we are returning the wrong slot, like we are returning the cached slot (current->client->slot) and not the real slot (do nothing with the debug assert)

We have a debug assert that validates the cached slot is the same as the real slot. That caused the crash that Nadav is referring to. In this specific case, we are checking the wrong slot for an expire in randomKey, which is not ideal and is a correctness issue. (We might return a key that is logically expired)

@nadav-levanoni
Copy link
Contributor Author

I've been looking more into this and I think the most correct solution would also come with a perf improvement (spoke with Dan about this). I'm still toying around with it locally, but I think that reworking most of the expiry API to have WithSlot variants would prevent us from unnecessarily recalculating the slot (getKeySlot) when the caller already has correct slot in its context.

ex. keys command can call keyIsExpiredWithSlot
https://github.com/valkey-io/valkey/blob/unstable/src/db.c#L841

in the case of the randomkey command specifically we can call functions dbFindExpiresWithSlot and expireIfNeededWithSlot. This would entirely avoid the call to getKeySlot.

@zuiderkwast
Copy link
Contributor

Does this bug apply also to other commands like KEYS and SCAN called within a transaction in the same way? I guess it does...

I think the most correct solution would also come with a perf improvement (spoke with Dan about this). I'm still toying around with it locally, but I think that reworking most of the expiry API to have WithSlot variants would prevent us from unnecessarily recalculating the slot (getKeySlot) when the caller already has correct slot in its context.

@nadav-levanoni In theory, I agree. I prefer a two-step approach in this case though:

  1. Do a simple fix. We can merge it fast and it can be backported to 7.2.x and 8.0.x.
  2. Do the refactoring, which may need more design and review and can go into 8.1.0.

On top of this, we have some other works in progress about restructuring how key-value objects and TTLs are stored, so this API may already change in the near future.'

@nadav-levanoni
Copy link
Contributor Author

nadav-levanoni commented Oct 11, 2024

This is my first PR, so I'm a little slow
#1155

Does this bug apply also to other commands like KEYS and SCAN called within a transaction in the same way?
AFAIK, any command that grabs a "random" key (not specified in the argument vector & not bound to a slot) is susceptible. I have not tried forcing the crash with the KEYS or SCAN command, but I think it may crash them as well.

@madolson madolson linked a pull request Oct 17, 2024 that will close this issue
madolson added a commit that referenced this issue Oct 17, 2024
#1145

First part of a two-step effort to add `WithSlot` API for expiry. This
PR is to fix a crash that occurs when a RANDOMKEY uses a different slot
than the cached slot of a client during a multi-exec.

The next part will be to utilize the new API as an optimization to
prevent duplicate work when calculating the slot for a key.

---------

Signed-off-by: Nadav Levanoni <[email protected]>
Signed-off-by: Madelyn Olson <[email protected]>
Co-authored-by: Nadav Levanoni <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]>
eifrah-aws pushed a commit to eifrah-aws/valkey that referenced this issue Oct 20, 2024
…o#1155)

valkey-io#1145

First part of a two-step effort to add `WithSlot` API for expiry. This
PR is to fix a crash that occurs when a RANDOMKEY uses a different slot
than the cached slot of a client during a multi-exec.

The next part will be to utilize the new API as an optimization to
prevent duplicate work when calculating the slot for a key.

---------

Signed-off-by: Nadav Levanoni <[email protected]>
Signed-off-by: Madelyn Olson <[email protected]>
Co-authored-by: Nadav Levanoni <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants