Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory usage of TypedSet #4123

Merged
merged 2 commits into from
Jul 30, 2020
Merged

Conversation

raunaqmorarka
Copy link
Member

No description provided.

@cla-bot cla-bot bot added the cla-signed label Jun 21, 2020
@raunaqmorarka raunaqmorarka requested a review from dain June 22, 2020 13:41
Copy link
Member

@Lewuathe Lewuathe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any microbenchmark result measuring how much memory usage we can reduce by this change?

Copy link
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this PR reduce memory usage in TypesSet?

presto-main/src/main/java/io/prestosql/type/TypeUtils.java Outdated Show resolved Hide resolved
else if (type instanceof VarcharType) {
// If bound on length of varchar is smaller than defaultSize, use that as expected size
return ((VarcharType) type).getLength()
.map(length -> Math.min(length, defaultSize))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Varchars usually won't occupy full declared length.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

varchar length is "character count", andvarchar is encoded in UTF-8 which means up to 4 bytes per character. If this is just an expected size, assuming ascii is reasoanble, but we should note that in a comment here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment for UTF-8

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think that assuming that varchar will occupy entire length is the correct one. It seems very pessimistic. What do you think @dain?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sopel39 that's why we do min here, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what contract of defaultSize is here. Could it be excessively large?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In current usage defaultSize is 16, 32 or 100 depending on where it's getting called from. This change should just reduce the expected size estimate when a smaller bound is known (E.g. varchar(10)). This would help reduce memory usage in cases where a large no. of TypeSet are generated but each set has small no. of entries which are a few characters long.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should assume the default size is one byte. I would update the comment a bit:

It can take up to 4 bytes per character due to UTF-8 encoding, but we assume the data is ASCII and only needs one byte.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comments in code

dain
dain previously requested changes Jul 13, 2020
Copy link
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @sopel39's comments. I don't think we need the first commit that changes the fast utils IntArrayList to an int[]. The second commit looks good, but needs a fix for the CHAR branch

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just skimming)

Copy link
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on Use getInt to access IntArrayList. I suggest making two separate PRs for the two commits

else if (type instanceof VarcharType) {
// If bound on length of varchar is smaller than defaultSize, use that as expected size
return ((VarcharType) type).getLength()
.map(length -> Math.min(length, defaultSize))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think that assuming that varchar will occupy entire length is the correct one. It seems very pessimistic. What do you think @dain?

Copy link
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dain do you want to take a look?

Copy link
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

else if (type instanceof VarcharType) {
// If bound on length of varchar is smaller than defaultSize, use that as expected size
return ((VarcharType) type).getLength()
.map(length -> Math.min(length, defaultSize))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should assume the default size is one byte. I would update the comment a bit:

It can take up to 4 bytes per character due to UTF-8 encoding, but we assume the data is ASCII and only needs one byte.

@raunaqmorarka raunaqmorarka requested a review from dain July 30, 2020 08:49
@sopel39 sopel39 dismissed dain’s stale review July 30, 2020 09:07

Dain gave lgtm

@sopel39 sopel39 merged commit 547eb1a into trinodb:master Jul 30, 2020
@sopel39
Copy link
Member

sopel39 commented Jul 30, 2020

merged, thanks!

@sopel39 sopel39 mentioned this pull request Jul 30, 2020
8 tasks
@raunaqmorarka raunaqmorarka deleted the typed_set_opt branch January 14, 2021 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants