Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RCORE-2157 Avoid to decompress Strings while sorting them. Instead use fast comparison provided by the interner #7892
RCORE-2157 Avoid to decompress Strings while sorting them. Instead use fast comparison provided by the interner #7892
Changes from 14 commits
2d58549
8089f9f
2142660
34f8bd3
865f599
d94c474
87b3d1a
6434ceb
7d62244
e22000b
f3f0c76
2c706f0
fd61aee
39adf24
5caa311
0c59fb5
4d9c3d1
3d6feb6
f3d4fb5
29b04b3
ee2cbec
8b68853
195716c
1629181
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will add unnecessary overhead per comparison when sorting on a non-string column. Ideally instead of doing the check inside the loop, we would have two separate specialized loops to keep other types fast. This might be complex to refactor though because there are already multiple optimizations piled in here.
ie. instead of this pattern:
we should do something like this:
I think we should have some benchmarks on sorting integers that can back up my claim, I wonder how much this change costs them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cost of fetching the String ID for a non-mixed/non-string properties is only the cost of checking that the property type may have a string interner associated (check on the column type) + an extra check for the optional type.
Also in the code, we iterate column by column, so the check per col type can only be perform when we are checking the column itself. I can try to run the benchmarks and see..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I think statically it does not make a huge difference, probably we are going to slow down a little bit sorting for all properties that are not strings or mixed, but, it is really difficult to gauge and also doesn't it depend on also by the data we have?
Another run (same ref data as starting cmp input)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a pretty big perf regression to me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a bug in my code, I was constantly passing by copy the Extended Column, these are the right perfs:
I am still failing to see why this is a massive problem for non-mixed or non-string columns. It does not seem like it is to me, but I must be wrong. Because you are both complaining about this :-)
I agree that Mixed and Strings are going to be impacted, though. Especially Mixed, because a mixed can hold anything in it…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I tried
BenchmarkSort
already. And it is slower. But at the same time it is not the best benchmark either IMO.Req runs: 5 Sort (MemOnly, EncryptionOff): * min 207.91ms (+72.16%) max 214.60ms (+48.24%) * med 210.27ms (+62.98%) * avg 211.07ms (+60.69%) stddev 2.99ms (-65.96%)
Mainly for these 3 reasons:
The internal DS that the string interner is keeping is not lock free, we are constantly grabbing a lock. Especially for sorting stuff, it is a real problem (since we may be calling the sorting operator a lot) . Jira to cover this: (https://jira.mongodb.org/browse/RCORE-2214)
We are constantly loading a lot of data when we need to access the compressed string leaves in the Trie we built, we need to improve the internal DS: (https://jira.mongodb.org/browse/RCORE-2125)
BenchmarkSort
is writing 200k randomly generated strings which are representing some number. This is very much the worse scenario for interning stuff. Also, in the feature branch we are not limiting compression of strings like we do for integers (where we check if there is a sizeable gain before compressing). We just compress all (I suspect we will introduce something).I've added a benchmark for testing the sorting of strings that are in compressed format (as you suggested). Passing 2 parameters to the test:
You can see that already, with all these limitations, there is a point after that it pays off to sort using the compressed string ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the non-string benchmarks, we appear to be less than 10% worse which I think seems reasonable. Ideally, that code would not be affected at all, and there may be a way to do this by templating out the non-string sorting, but I'm not sure it is worth the duplication that will create.
For the BenchmarkSort case, I don't think we should ship this feature until we get that regression down by quite a bit. Can you create a ticket for the third item you suggested so that we don't forget about it? As long as we are tracking those 3 approaches and get them and recheck the benchmarks later, it shouldn't block this PR now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep I agree.
There is a bit of tech debt in the sorting logic. I tried to make the code templated based on column type, but I quit, because the code was basically duplicated, also we need to extract the column type in at least 2 cases. First one when we set the index pair and the other one when we are sorting the index pairs. I concluded it was too much hassle. But maybe there is a better way...
Yes I agree with you for string benchmarks, in fact all the code is not going into next major. We do have already a jira for this (https://jira.mongodb.org/browse/RCORE-2125). I will drop a comment in order to track this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant: that the code is not finished, we need to work on the optimizations I mentioned. The feature branch is just our base. Once we complete all the features, it will of course go into next major, like we did for integers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is covering very little of the newly added code. It needs to check sorting mixed with all of compressed strings, non-compressed strings, and non-strings, along with sorting over links.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep you are right, but we do have many of these tests already, I will add more tests like this and be sure we are covering all, and not solely rely on our existing tests (if they commit the strings before)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the tests in order to cover what you asked @tgoyne .