-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RCORE-2157 Avoid to decompress Strings while sorting them. Instead use fast comparison provided by the interner #7892
Conversation
Pull Request Test Coverage Report for Build nicola.cabiddu_1882Details
💛 - Coveralls |
4dd9740
to
87b3d1a
Compare
src/realm/sort_descriptor.cpp
Outdated
@@ -434,20 +476,33 @@ bool BaseDescriptor::Sorter::operator()(IndexPair i, IndexPair j, bool total_ord | |||
const auto& obj = m_columns[t].table->get_object(key_i); | |||
const auto& col_key = m_columns[t].col_key; | |||
|
|||
cache_i.value = col_key.get_value(obj); | |||
// store stringID instead of the actual string if possible | |||
cache_i.cached_string_id = col_key.get_string_id(obj); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will add unnecessary overhead per comparison when sorting on a non-string column. Ideally instead of doing the check inside the loop, we would have two separate specialized loops to keep other types fast. This might be complex to refactor though because there are already multiple optimizations piled in here.
ie. instead of this pattern:
for (..) {
if interned string
do special comparison
else
do value comparison
}
}
we should do something like this:
if column supports string interning
for (...) {
do special intern comparison if possible
}
} else {
for (...) {
do normal value comparison
}
}
I think we should have some benchmarks on sorting integers that can back up my claim, I wonder how much this change costs them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cost of fetching the String ID for a non-mixed/non-string properties is only the cost of checking that the property type may have a string interner associated (check on the column type) + an extra check for the optional type.
Also in the code, we iterate column by column, so the check per col type can only be perform when we are checking the column itself. I can try to run the benchmarks and see..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I think statically it does not make a huge difference, probably we are going to slow down a little bit sorting for all properties that are not strings or mixed, but, it is really difficult to gauge and also doesn't it depend on also by the data we have?
Req runs: 6 SortInt (MemOnly, EncryptionOff): min 57.16ms (+3.73%) max 76.90ms (+30.61%) med 66.13ms (+17.93%) avg 66.87ms (+18.40%) stddev 8.98ms (+586.99%)
Req runs: 5 SortIntList (MemOnly, EncryptionOff): min 321.11ms (+15.11%) max 397.76ms (+27.72%) med 333.63ms (+9.98%) avg 353.77ms (+18.18%) stddev 36.49ms (+194.74%)
Req runs: 10 SortIntDictionary (MemOnly, EncryptionOff): min 37.68ms (-0.91%) max 56ms (+7.95%) med 40.05ms (-5.75%) avg 42.62ms (-1.18%) stddev 6.44ms (+38.90%)
Req runs: 225 SortThenLimit (MemOnly, EncryptionOff): min 1.89ms (+14.04%) max 3.82ms (+44.19%) med 2.09ms (+21.53%) avg 2.17ms (+23.60%) stddev 284us (+144.66%)
Another run (same ref data as starting cmp input)
Req runs: 7 SortInt (MemOnly, EncryptionOff): min 56.62ms (+2.75%) max 64.11ms (+8.90%) med 57.26ms (+2.11%) avg 58.38ms (+3.36%) stddev 2.67ms (+104.06%)
Req runs: 5 SortIntList (MemOnly, EncryptionOff): min 298.46ms (+6.99%) max 332.43ms (+6.74%) med 304.79ms (+0.47%) avg 309.70ms (+3.46%) stddev 13.23ms (+6.88%)
Req runs: 11 SortIntDictionary (MemOnly, EncryptionOff): min 36.85ms (-3.08%) max 39.14ms (-24.55%) med 37.59ms (-11.52%) avg 37.72ms (-12.54%) stddev 629us (-86.44%)
Req runs: 241 SortThenLimit (MemOnly, EncryptionOff): min 1.87ms (+12.72%) max 3.40ms (+28.36%) med 1.98ms (+14.96%) avg 2.06ms (+17.05%) stddev 228us (+95.75%)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a pretty big perf regression to me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a bug in my code, I was constantly passing by copy the Extended Column, these are the right perfs:
Req runs: 8 SortInt (MemOnly, EncryptionOff): min 58.66ms (+6.49%) max 66.55ms (+11.33%) med 60.24ms (+4.87%) avg 61.30ms (+6.42%) stddev 3.06ms (+70.90%)
Req runs: 5 SortIntList (MemOnly, EncryptionOff): min 290.56ms (+3.83%) max 304.51ms (-1.68%) med 302.37ms (+3.53%) avg 299.28ms (+1.51%) stddev 6.12ms (-45.79%)
Req runs: 10 SortIntDictionary (MemOnly, EncryptionOff): min 37.89ms (-3.70%) max 45.80ms (-12.86%) med 41.27ms (-0.43%) avg 41.59ms (-3.61%) stddev 2.53ms (-36.80%)
Req runs: 231 SortThenLimit (MemOnly, EncryptionOff): min 1.84ms (+10.23%) max 2.93ms (-24.94%) med 1.95ms (+6.03%) avg 1.98ms (+1.44%) stddev 144us (-58.33%)
I am still failing to see why this is a massive problem for non-mixed or non-string columns. It does not seem like it is to me, but I must be wrong. Because you are both complaining about this :-)
I agree that Mixed and Strings are going to be impacted, though. Especially Mixed, because a mixed can hold anything in it…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I tried BenchmarkSort
already. And it is slower. But at the same time it is not the best benchmark either IMO.
Req runs: 5 Sort (MemOnly, EncryptionOff): * min 207.91ms (+72.16%) max 214.60ms (+48.24%) * med 210.27ms (+62.98%) * avg 211.07ms (+60.69%) stddev 2.99ms (-65.96%)
Mainly for these 3 reasons:
-
The internal DS that the string interner is keeping is not lock free, we are constantly grabbing a lock. Especially for sorting stuff, it is a real problem (since we may be calling the sorting operator a lot) . Jira to cover this: (https://jira.mongodb.org/browse/RCORE-2214)
-
We are constantly loading a lot of data when we need to access the compressed string leaves in the Trie we built, we need to improve the internal DS: (https://jira.mongodb.org/browse/RCORE-2125)
-
BenchmarkSort
is writing 200k randomly generated strings which are representing some number. This is very much the worse scenario for interning stuff. Also, in the feature branch we are not limiting compression of strings like we do for integers (where we check if there is a sizeable gain before compressing). We just compress all (I suspect we will introduce something).
I've added a benchmark for testing the sorting of strings that are in compressed format (as you suggested). Passing 2 parameters to the test:
- N length of the strings to store
- M the number of duplicated strings
You can see that already, with all these limitations, there is a point after that it pays off to sort using the compressed string ID.
Req runs: 5 SortCompressedSmall(10,500) (MemOnly, EncryptionOff): * min 124.38ms (+30.21%) max 143.33ms (+17.40%) * med 138.98ms (+27.34%) * avg 136.07ms (+24.38%) stddev 7.89ms (-27.27%)
Req runs: 5 SortCompressedMedium(50,500) (MemOnly, EncryptionOff): * min 125.46ms (+11.46%) max 131.65ms (+9.01%) * med 128.17ms (+12.81%) * avg 128.03ms (+11.23%) stddev 2.47ms (-25.63%)
Req runs: 5 SortCompressedLarge(100,500) (MemOnly, EncryptionOff): * min 128.91ms (+16.33%) max 145.32ms (+26.75%) * med 131.26ms (+17.72%) * avg 134.01ms (+19.24%) stddev 6.69ms (+302.40%)
Req runs: 5 SortCompressedLarge(1000,5000) (MemOnly, EncryptionOff): min 196.71ms (-30.88%) max 208.25ms (-36.51%) med 202.42ms (-29.76%) avg 202.50ms (-31.61%) stddev 4.09ms (-77.60%)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the non-string benchmarks, we appear to be less than 10% worse which I think seems reasonable. Ideally, that code would not be affected at all, and there may be a way to do this by templating out the non-string sorting, but I'm not sure it is worth the duplication that will create.
For the BenchmarkSort case, I don't think we should ship this feature until we get that regression down by quite a bit. Can you create a ticket for the third item you suggested so that we don't forget about it? As long as we are tracking those 3 approaches and get them and recheck the benchmarks later, it shouldn't block this PR now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep I agree.
-
There is a bit of tech debt in the sorting logic. I tried to make the code templated based on column type, but I quit, because the code was basically duplicated, also we need to extract the column type in at least 2 cases. First one when we set the index pair and the other one when we are sorting the index pairs. I concluded it was too much hassle. But maybe there is a better way...
-
Yes I agree with you for string benchmarks, in fact all the code is not going into next major. We do have already a jira for this (https://jira.mongodb.org/browse/RCORE-2125). I will drop a comment in order to track this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in fact all the code is not going into next major
What do you mean by this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant: that the code is not finished, we need to work on the optimizations I mentioned. The feature branch is just our base. Once we complete all the features, it will of course go into next major, like we did for integers.
88f9194
to
e22000b
Compare
344458c
to
93f128c
Compare
93f128c
to
3d6feb6
Compare
0f8e791
to
5841658
Compare
5841658
to
29b04b3
Compare
src/realm/sort_descriptor.cpp
Outdated
return -interner->compare(m_j.get_string(), (StringID)m_i.get_int()); | ||
} | ||
|
||
// 4. compare string vs any other non-string (this is going to be slow) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if one side is a string and the other isn't then we don't actually need to fetch the string value because the Mixed comparison is only about the type ordering, not the values. So you could speed this up by just using a constant empty string.
return distribution(generator); | ||
} | ||
|
||
static std::string gen_random_string(size_t length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea to test random strings. I'd love to see some utf8 strings tested as well. My suggestion would be to look at the tests such as UTF8_Compare_Strings
(and the others nearby) where you could modify the str_compare
lambda to use compression as well and verify that both modes produce the same sort order.
Could you add a benchmark for sorting compressed strings? The sorting logic is getting very messy and is due to a clean up, and having a benchmark in place for when that happens would be very helpful. |
6ad7b4f
to
8b68853
Compare
test/test_utf8.cpp
Outdated
@@ -169,6 +171,45 @@ TEST(UTF8_Compare_Core_utf8) | |||
CHECK_EQUAL(false, str_compare(u16sur2, u16sur2)); | |||
} | |||
|
|||
TEST(UTF8_Compare_Core_utf8_interned_strings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to not make a copy of the tests. I'd suggest modifying the existing tests to do the interned sort inside of the str_compare lambda and check that it is the same result as the non-interned comparison. The benefit of doing it that way, is that we can easily add more values to both cases, and we can directly see if there are any differences in the comparisons rather than doing a diff across tests.
test/benchmark-common-tasks/main.cpp
Outdated
// M is the numer of times we want store the string (number of dups) | ||
template <size_t N, size_t M> | ||
struct BenchmarkSortCompressed : BenchmarkWithStrings { | ||
mutable std::string _name; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this mutable? the leading underscore is also unusual for our naming conventions.
test/benchmark-common-tasks/main.cpp
Outdated
@@ -1630,6 +1630,72 @@ struct BenchmarkSort : BenchmarkWithStrings { | |||
} | |||
}; | |||
|
|||
// benchmark for testing compressed string sorting. | |||
// N is the size of the string to generate | |||
// M is the numer of times we want store the string (number of dups) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// M is the numer of times we want store the string (number of dups) | |
// M is the number of times we want store the string (number of dups) |
test/benchmark-common-tasks/main.cpp
Outdated
: BenchmarkWithStrings() | ||
{ | ||
if constexpr (N <= 15) { | ||
_name = util::format("SortCompressedSmall(%1,%2)", std::to_string(N), std::to_string(M)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the std::to_string(N)
necessary? I would expect util::format
to do the right thing with a constant number.
|
||
void before_all(DBRef group) | ||
{ | ||
BenchmarkWithStringsTable::before_all(group); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this adds a bunch of strings to the table, do you really want those in there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no.. removed.
test/benchmark-common-tasks/main.cpp
Outdated
|
||
Obj obj = t->create_object(); | ||
obj.set<StringData>(m_col, str); | ||
m_keys.push_back(obj.get_key()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m_keys appears to be unused?
@@ -2623,10 +2623,6 @@ TEST(Test_Commit_Compression_Strings) | |||
auto set_s = obj.get_set<String>(col_key_set_string); | |||
auto dictionary_s = obj.get_dictionary(col_key_dict_string); | |||
|
|||
CHECK_EQUAL(list_s.size(), i + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not related to this change, but in one of the N runs I happened to hit a generation of the same string twice. Checking the size is not smart, It is OK to just fetch the value.
test/test_utf8.cpp
Outdated
|
||
parent.destroy_deep(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works, but I'd suggest using a DeepArrayDestroyGuard
on the parent array at the top instead.
test/test_utf8.cpp
Outdated
auto id1 = interner.lookup(a); | ||
auto id2 = interner.lookup(b); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could intern the strings on the fly, if they don't exist already. Then you don't need to worry about adding them manually. That should reduce the code duplication and make it easier to add more strings in the future.
With this pattern, you should be able to easily adapt the UTF8_Compare_Strings
test as well.
4e42307
to
1629181
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for now. The discussion here assures that this feature won't be shipped without another analysis of the performance numbers.
What, How & Why?
This PR is about comparing the compressed string views via their string ids, while sorting results for queries.
Right now, we are decompressing strings and then sorting, which is more expensive.