RCORE-2157 Avoid to decompress Strings while sorting them. Instead use fast comparison provided by the interner #7892

nicola-cab · 2024-07-16T16:54:29Z

What, How & Why?

This PR is about comparing the compressed string views via their string ids, while sorting results for queries.
Right now, we are decompressing strings and then sorting, which is more expensive.

…ore into nc/RCORE-2157

coveralls-official · 2024-07-18T12:11:34Z

Pull Request Test Coverage Report for Build nicola.cabiddu_1882

Details

384 of 394 (97.46%) changed or added relevant lines in 8 files are covered.
40 unchanged lines in 9 files lost coverage.
Overall coverage increased (+0.009%) to 90.966%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
src/realm/obj.cpp	18	19	94.74%
src/realm/obj.hpp	0	3	0.0%
src/realm/sort_descriptor.hpp	6	12	50.0%

Files with Coverage Reduction	New Missed Lines	%
src/realm/dictionary.cpp	1	85.52%
src/realm/index_string.hpp	1	93.48%
test/test_index_string.cpp	1	93.33%
src/realm/sync/noinst/client_impl_base.cpp	2	82.27%
src/realm/sync/noinst/server/server.cpp	2	73.49%
test/object-store/sync/client_reset.cpp	2	99.0%
src/realm/bplustree.cpp	6	73.08%
src/realm/index_string.cpp	8	84.63%
test/fuzz_group.cpp	17	45.44%

Totals
Change from base Build jorgen.edelbo_372:	0.009%
Covered Lines:	219716
Relevant Lines:	241536

💛 - Coveralls

src/realm/path.hpp

src/realm/sort_descriptor.cpp

src/realm/sort_descriptor.hpp

src/realm/sort_descriptor.cpp

ironage · 2024-07-18T18:30:55Z

src/realm/sort_descriptor.cpp

@@ -434,20 +476,33 @@ bool BaseDescriptor::Sorter::operator()(IndexPair i, IndexPair j, bool total_ord
                const auto& obj = m_columns[t].table->get_object(key_i);
                const auto& col_key = m_columns[t].col_key;

-                cache_i.value = col_key.get_value(obj);
+                // store stringID instead of the actual string if possible
+                cache_i.cached_string_id = col_key.get_string_id(obj);


This will add unnecessary overhead per comparison when sorting on a non-string column. Ideally instead of doing the check inside the loop, we would have two separate specialized loops to keep other types fast. This might be complex to refactor though because there are already multiple optimizations piled in here.
ie. instead of this pattern:

for (..) { if interned string do special comparison else do value comparison } }

we should do something like this:

if column supports string interning for (...) { do special intern comparison if possible } } else { for (...) { do normal value comparison } }

I think we should have some benchmarks on sorting integers that can back up my claim, I wonder how much this change costs them?

The cost of fetching the String ID for a non-mixed/non-string properties is only the cost of checking that the property type may have a string interner associated (check on the column type) + an extra check for the optional type.
Also in the code, we iterate column by column, so the check per col type can only be perform when we are checking the column itself. I can try to run the benchmarks and see..

Yep, I think statically it does not make a huge difference, probably we are going to slow down a little bit sorting for all properties that are not strings or mixed, but, it is really difficult to gauge and also doesn't it depend on also by the data we have?

Req runs: 6 SortInt (MemOnly, EncryptionOff): min 57.16ms (+3.73%) max 76.90ms (+30.61%) med 66.13ms (+17.93%) avg 66.87ms (+18.40%) stddev 8.98ms (+586.99%) Req runs: 5 SortIntList (MemOnly, EncryptionOff): min 321.11ms (+15.11%) max 397.76ms (+27.72%) med 333.63ms (+9.98%) avg 353.77ms (+18.18%) stddev 36.49ms (+194.74%) Req runs: 10 SortIntDictionary (MemOnly, EncryptionOff): min 37.68ms (-0.91%) max 56ms (+7.95%) med 40.05ms (-5.75%) avg 42.62ms (-1.18%) stddev 6.44ms (+38.90%) Req runs: 225 SortThenLimit (MemOnly, EncryptionOff): min 1.89ms (+14.04%) max 3.82ms (+44.19%) med 2.09ms (+21.53%) avg 2.17ms (+23.60%) stddev 284us (+144.66%)

Another run (same ref data as starting cmp input)

Req runs: 7 SortInt (MemOnly, EncryptionOff): min 56.62ms (+2.75%) max 64.11ms (+8.90%) med 57.26ms (+2.11%) avg 58.38ms (+3.36%) stddev 2.67ms (+104.06%) Req runs: 5 SortIntList (MemOnly, EncryptionOff): min 298.46ms (+6.99%) max 332.43ms (+6.74%) med 304.79ms (+0.47%) avg 309.70ms (+3.46%) stddev 13.23ms (+6.88%) Req runs: 11 SortIntDictionary (MemOnly, EncryptionOff): min 36.85ms (-3.08%) max 39.14ms (-24.55%) med 37.59ms (-11.52%) avg 37.72ms (-12.54%) stddev 629us (-86.44%) Req runs: 241 SortThenLimit (MemOnly, EncryptionOff): min 1.87ms (+12.72%) max 3.40ms (+28.36%) med 1.98ms (+14.96%) avg 2.06ms (+17.05%) stddev 228us (+95.75%)

This looks like a pretty big perf regression to me?

There was a bug in my code, I was constantly passing by copy the Extended Column, these are the right perfs:

Req runs: 8 SortInt (MemOnly, EncryptionOff): min 58.66ms (+6.49%) max 66.55ms (+11.33%) med 60.24ms (+4.87%) avg 61.30ms (+6.42%) stddev 3.06ms (+70.90%) Req runs: 5 SortIntList (MemOnly, EncryptionOff): min 290.56ms (+3.83%) max 304.51ms (-1.68%) med 302.37ms (+3.53%) avg 299.28ms (+1.51%) stddev 6.12ms (-45.79%) Req runs: 10 SortIntDictionary (MemOnly, EncryptionOff): min 37.89ms (-3.70%) max 45.80ms (-12.86%) med 41.27ms (-0.43%) avg 41.59ms (-3.61%) stddev 2.53ms (-36.80%) Req runs: 231 SortThenLimit (MemOnly, EncryptionOff): min 1.84ms (+10.23%) max 2.93ms (-24.94%) med 1.95ms (+6.03%) avg 1.98ms (+1.44%) stddev 144us (-58.33%)

I am still failing to see why this is a massive problem for non-mixed or non-string columns. It does not seem like it is to me, but I must be wrong. Because you are both complaining about this :-)

I agree that Mixed and Strings are going to be impacted, though. Especially Mixed, because a mixed can hold anything in it…

Yep, I tried BenchmarkSort already. And it is slower. But at the same time it is not the best benchmark either IMO.

Req runs: 5 Sort (MemOnly, EncryptionOff): * min 207.91ms (+72.16%) max 214.60ms (+48.24%) * med 210.27ms (+62.98%) * avg 211.07ms (+60.69%) stddev 2.99ms (-65.96%)

Mainly for these 3 reasons:

The internal DS that the string interner is keeping is not lock free, we are constantly grabbing a lock. Especially for sorting stuff, it is a real problem (since we may be calling the sorting operator a lot) . Jira to cover this: (https://jira.mongodb.org/browse/RCORE-2214)

We are constantly loading a lot of data when we need to access the compressed string leaves in the Trie we built, we need to improve the internal DS: (https://jira.mongodb.org/browse/RCORE-2125)

BenchmarkSort is writing 200k randomly generated strings which are representing some number. This is very much the worse scenario for interning stuff. Also, in the feature branch we are not limiting compression of strings like we do for integers (where we check if there is a sizeable gain before compressing). We just compress all (I suspect we will introduce something).

I've added a benchmark for testing the sorting of strings that are in compressed format (as you suggested). Passing 2 parameters to the test:

N length of the strings to store

M the number of duplicated strings

You can see that already, with all these limitations, there is a point after that it pays off to sort using the compressed string ID.

Req runs: 5 SortCompressedSmall(10,500) (MemOnly, EncryptionOff): * min 124.38ms (+30.21%) max 143.33ms (+17.40%) * med 138.98ms (+27.34%) * avg 136.07ms (+24.38%) stddev 7.89ms (-27.27%) Req runs: 5 SortCompressedMedium(50,500) (MemOnly, EncryptionOff): * min 125.46ms (+11.46%) max 131.65ms (+9.01%) * med 128.17ms (+12.81%) * avg 128.03ms (+11.23%) stddev 2.47ms (-25.63%) Req runs: 5 SortCompressedLarge(100,500) (MemOnly, EncryptionOff): * min 128.91ms (+16.33%) max 145.32ms (+26.75%) * med 131.26ms (+17.72%) * avg 134.01ms (+19.24%) stddev 6.69ms (+302.40%) Req runs: 5 SortCompressedLarge(1000,5000) (MemOnly, EncryptionOff): min 196.71ms (-30.88%) max 208.25ms (-36.51%) med 202.42ms (-29.76%) avg 202.50ms (-31.61%) stddev 4.09ms (-77.60%)

For the non-string benchmarks, we appear to be less than 10% worse which I think seems reasonable. Ideally, that code would not be affected at all, and there may be a way to do this by templating out the non-string sorting, but I'm not sure it is worth the duplication that will create.

For the BenchmarkSort case, I don't think we should ship this feature until we get that regression down by quite a bit. Can you create a ticket for the third item you suggested so that we don't forget about it? As long as we are tracking those 3 approaches and get them and recheck the benchmarks later, it shouldn't block this PR now.

Yep I agree.

There is a bit of tech debt in the sorting logic. I tried to make the code templated based on column type, but I quit, because the code was basically duplicated, also we need to extract the column type in at least 2 cases. First one when we set the index pair and the other one when we are sorting the index pairs. I concluded it was too much hassle. But maybe there is a better way...

Yes I agree with you for string benchmarks, in fact all the code is not going into next major. We do have already a jira for this (https://jira.mongodb.org/browse/RCORE-2125). I will drop a comment in order to track this.

in fact all the code is not going into next major

What do you mean by this?

I meant: that the code is not finished, we need to work on the optimizations I mentioned. The feature branch is just our base. Once we complete all the features, it will of course go into next major, like we did for integers.

src/realm/sort_descriptor.cpp

test/test_query.cpp

ironage · 2024-07-30T21:45:25Z

src/realm/sort_descriptor.cpp

+            return -interner->compare(m_j.get_string(), (StringID)m_i.get_int());
+    }
+
+    // 4. compare string vs any other non-string (this is going to be slow)


if one side is a string and the other isn't then we don't actually need to fetch the string value because the Mixed comparison is only about the type ordering, not the values. So you could speed this up by just using a constant empty string.

ironage · 2024-07-30T21:45:33Z

test/test_query.cpp

+    return distribution(generator);
+}
+
+static std::string gen_random_string(size_t length)


Good idea to test random strings. I'd love to see some utf8 strings tested as well. My suggestion would be to look at the tests such as UTF8_Compare_Strings (and the others nearby) where you could modify the str_compare lambda to use compression as well and verify that both modes produce the same sort order.

ironage · 2024-07-30T21:47:33Z

Could you add a benchmark for sorting compressed strings? The sorting logic is getting very messy and is due to a clean up, and having a benchmark in place for when that happens would be very helpful.

ironage · 2024-07-31T17:08:11Z

test/test_utf8.cpp

@@ -169,6 +171,45 @@ TEST(UTF8_Compare_Core_utf8)
    CHECK_EQUAL(false, str_compare(u16sur2, u16sur2));
 }

+TEST(UTF8_Compare_Core_utf8_interned_strings)


It would be better to not make a copy of the tests. I'd suggest modifying the existing tests to do the interned sort inside of the str_compare lambda and check that it is the same result as the non-interned comparison. The benefit of doing it that way, is that we can easily add more values to both cases, and we can directly see if there are any differences in the comparisons rather than doing a diff across tests.

ironage · 2024-07-31T17:17:50Z