Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use segmented Slice in SliceDictionaryWriter #15956

Merged
merged 1 commit into from
Apr 26, 2021

Conversation

arunthirupathi
Copy link

@arunthirupathi arunthirupathi commented Apr 17, 2021

Store elements of dictionary in Segmented Slices, instead of
one contiguous segment. When the number of elements
in the dictionary is less than 100,000 there is no noticeable
performance degradation. When the number of elements in the
dictionary reaches 10,000,000 sorting/comparing the element
needs to compute segment/offset which makes it worse by 10%.
But this is an unlikely case.

Test plan -
Added new test cases for the SegmentedSlices.
Dictionary is covered by existing tests.

== RELEASE NOTES ==

General Changes
* Store dictionary elements in Segmented Slice.

@arunthirupathi arunthirupathi marked this pull request as draft April 17, 2021 07:22
@arunthirupathi arunthirupathi changed the title Segmented Slice builder Use segmented Slice in SliceDictionaryWriter Apr 17, 2021
@arunthirupathi arunthirupathi marked this pull request as ready for review April 17, 2021 18:43
@arunthirupathi
Copy link
Author

arunthirupathi commented Apr 18, 2021

Here is the performance comparison before and after this change. Base is before this change. Segmented is with this change. Note Direct in base and direct in segmented has no change between them and they should be 100% and the result is close enough.

MIN of Score Benchmark
(typeSignature) (uniqueValuesPercentage) Direct.Segmented Direct.Base DictionaryToDirect.Segmented DictionaryToDirect.Base Dictionary.Segmented Dictionary.Base
varchar 1 100.50% 100.00% 316.04% 314.30% 288.96% 280.23%
5 100.00% 113.50% 424.47% 371.08% 432.08% 428.40%
10 100.00% 105.99% 537.32% 518.47% 702.22% 711.15%
100 100.00% 106.36% 1169.77% 1293.29% 5512.29% 4659.38%

Copy link
Contributor

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly nits

Comment on lines 294 to 291
public BlockBuilder newBlockBuilderLike(BlockBuilderStatus blockBuilderStatus, int expectedEntries)
{
if (blockBuilderStatus != null) {
throw new UnsupportedOperationException("Not yet implemented");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blockBuilderStatus is actually fairly important. QQ: Is newBlockBuilderLike used anywhere in orc package? If not, just throw?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This blockBuilder is only used by the SliceDictionaryBuilder and it passes in null for the blockBuilderStatus.

https://github.com/prestodb/presto/blob/master/presto-orc/src/main/java/com/facebook/presto/orc/writer/SliceDictionaryBuilder.java#L51


private final DynamicSliceOutput openSliceOutput;

private int openSegment;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/openSegment/openSegmentIndex, if I read the code correctly

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, renamed it.

offsets[openSegment][openSegmentOffset] = openSliceOutput.size();
if (openSegmentOffset == SegmentHelper.SEGMENT_SIZE) {
// Add the current finalized slice to closedSlices
Slice slice = openSliceOutput.copySlice();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do a copy to save space? We usually don't call this method to avoid heavy GC. Maybe slice() is good enough.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slice() gives the view of the object. After the segment is full, the bytes[] are copied to the copySllice and the dynamicSlliceOuptut is reset and reused for the new segment.

@arunthirupathi arunthirupathi force-pushed the segmented_slice_builder_2 branch 2 times, most recently from 21cd7b1 to 9bf70ba Compare April 23, 2021 23:17
Copy link
Contributor

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits only

@highker highker self-assigned this Apr 25, 2021
Store elements of dictionary in Segmented Slices, instead of
one contiguous segment. When the number of elements
in the dictionary is less than 100,000 there is no noticeable
performance degradation. When the number of elements in the
dictionary reaches 10,000,000 sorting/comparing the element
needs to compute segment/offset which makes it worse by 10%.
But this is an unlikely case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants