Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix CLUSTER SHARDS crash in 7.0/7.2 mixed clusters where shard ids ar…
…e not sync Crash reported in redis#12695. In the process of upgrading the cluster from 7.0 to 7.2, because the 7.0 nodes will not gossip shard id, in 7.2 we will rely on shard id to build the server.cluster->shards dict. In some cases, for example, the 7.0 master node and the 7.2 replica node. From the view of 7.2 replica node, the cluster->shards dictionary does not have its master node. In this case calling CLUSTER SHARDS on the 7.2 replica node may crash. A CLUSTER SHARDS result output: ``` 1) 1) "slots" 2) 1) (integer) 0 2) (integer) 5461 3) (integer) 0 4) (integer) 5461 5) (integer) 0 ``` We can see that the output contains repeated slots, and each call will append a new one, and then crash on serverAssert: ```c void clusterGenNodesSlotsInfo(int filter) { ... /* Generate slots info when occur different node with start * or end of slot. */ if (i == CLUSTER_SLOTS || n != server.cluster->slots[i]) { if (!(n->flags & filter)) { if (!n->slot_info_pairs) { n->slot_info_pairs = zmalloc(2 * n->numslots * sizeof(uint16_t)); } serverAssert((n->slot_info_pairs_count + 1) < (2 * n->numslots)); n->slot_info_pairs[n->slot_info_pairs_count++] = start; n->slot_info_pairs[n->slot_info_pairs_count++] = i-1; } if (i == CLUSTER_SLOTS) break; n = server.cluster->slots[i]; start = i; } ... } ``` The reason is that in addShardReplyForClusterShards we are not able to clean up the slot_info_pairs corresponding to the 7.0 master node. In the code below, we will loop to find the 7.0 master node, and then we will call clusterFreeNodesSlotsInfo to clean up slot_info_pairs according to the shard id dict list, but the 7.0 master node is not in the list. ```c void addShardReplyForClusterShards(client *c, list *nodes) { ... /* Use slot_info_pairs from the primary only */ while (n->slaveof != NULL) n = n->slaveof; ... addReplyBulkCString(c, "nodes"); addReplyArrayLen(c, listLength(nodes)); listIter li; listRewind(nodes, &li); for (listNode *ln = listNext(&li); ln != NULL; ln = listNext(&li)) { clusterNode *n = listNodeValue(ln); addNodeDetailsToShardReply(c, n); clusterFreeNodesSlotsInfo(n); } } ``` We should fix the underlying assumption of updateShardId, which is that the shard dict should be always in sync with the node's shard_id. The fix was suggested by PingXie, see more details in redis#12695. Co-authored-by: Ping Xie <[email protected]>
- Loading branch information