From 30021578dd256e2cb7029a26ca25689581e1f933 Mon Sep 17 00:00:00 2001
From: Binbin <binloveplay1314@qq.com>
Date: Tue, 28 Nov 2023 19:50:56 +0800
Subject: [PATCH] Use CLZ in _dictNextExp to get the next power of two

In the past, we did not call _dictNextExp frequently. It was only
called when the dictionary was expanded.

Later, dictTypeExpandAllowed was introduced in #7954, which is 6.2.
For the data dict and the expire dict, we can check maxmemory before
actually expanding the dict. This is a good optimization to avoid
maxmemory being exceeded due to the dict expansion.

And in #11692, we moved the dictTypeExpandAllowed check before the
threshold check, this caused a bit of performance degradation, every
time a key is added to the dict, dictTypeExpandAllowed is called to check.

The main reason for degradation is that in a large dict, we need to
call _dictNextExp frequently, that is, every time we add a key, we
need to call _dictNextExp once. Then the threshold is checked to see
if the dict needs to be expanded. We can see that the order of checks
here can be optimized.

So we moved the dictTypeExpandAllowed check back to after the threshold
check in #12789. In this way, before the dict is actually expanded (that
is, before the threshold is reached), we will not do anything extra
compared to before, that is, we will not call _dictNextExp frequently.

But note we'll still hit the degradation when we over the thresholds.
When the threshold is reached, because #7954, we may delay the dict
expansion due to maxmemory limitations. In this case, we will call
_dictNextExp every time we add a key during this period.

This PR use CLZ in _dictNextExp to get the next power of two. CLZ (count
leading zeros) can easily give you the next power of two. It should be
noted that we have actually introduced the use of __builtin_clzl in #8687,
which is 7.0. So i suppose all the platforms we use have it (even if the
CPU doesn't have an instruction).

We build 67108864 (2**26) keys through DEBUG POPULTE, which will use
approximately 5.49G memory (used_memory:5898522936). If expansion is
triggered, the additional hash table will consume approximately 1G
memory (2 ** 27 * 8). So we set maxmemory to 6871947673 (that is, 6.4G),
which will be less than 5.49G + 1G, so we will delay the dict rehash
while addint the keys.

After that, each time an element is added to the dict, an allow check
will be performed, that is, we can frequently call _dictNextExp to test
the comparison before and after the optimization. Using DEBUG HTSTATS 0 to
check and make sure that our dict expansion is dealyed.

Using `./src/redis-benchmark -P 100 -r 1000000000 -t set -n 5000000`,
After ten rounds of testing:
```
unstable:           this PR:
769585.94           816860.00
771724.00           818196.69
775674.81           822368.44
781983.12           822503.69
783576.25           828088.75
784190.75           828637.75
791389.69           829875.50
794659.94           835660.69
798212.00           830013.25
801153.62           833934.56
```

We can see there is about 4-5% performance improvement in this case.
---
 src/dict.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/src/dict.c b/src/dict.c
index 79988d7681c..66a9bf951cd 100644
--- a/src/dict.c
+++ b/src/dict.c
@@ -1422,18 +1422,13 @@ static int _dictExpandIfNeeded(dict *d)
     return DICT_OK;
 }
 
-/* TODO: clz optimization */
 /* Our hash table capability is a power of two */
 static signed char _dictNextExp(unsigned long size)
 {
-    unsigned char e = DICT_HT_INITIAL_EXP;
-
+    if (size <= DICT_HT_INITIAL_SIZE) return DICT_HT_INITIAL_EXP;
     if (size >= LONG_MAX) return (8*sizeof(long)-1);
-    while(1) {
-        if (((unsigned long)1<<e) >= size)
-            return e;
-        e++;
-    }
+
+    return 8*sizeof(long) - __builtin_clzl(size-1);
 }
 
 /* Finds and returns the position within the dict where the provided key should