-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce memory usage of TypedSet #4123
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any microbenchmark result measuring how much memory usage we can reduce by this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this PR reduce memory usage in TypesSet
?
presto-main/src/main/java/io/prestosql/operator/aggregation/TypedSet.java
Outdated
Show resolved
Hide resolved
else if (type instanceof VarcharType) { | ||
// If bound on length of varchar is smaller than defaultSize, use that as expected size | ||
return ((VarcharType) type).getLength() | ||
.map(length -> Math.min(length, defaultSize)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Varchars usually won't occupy full declared length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
varchar
length is "character count", andvarchar
is encoded in UTF-8 which means up to 4 bytes per character. If this is just an expected size, assuming ascii is reasoanble, but we should note that in a comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment for UTF-8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't think that assuming that varchar will occupy entire length is the correct one. It seems very pessimistic. What do you think @dain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sopel39 that's why we do min
here, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what contract of defaultSize
is here. Could it be excessively large?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In current usage defaultSize
is 16, 32 or 100 depending on where it's getting called from. This change should just reduce the expected size estimate when a smaller bound is known (E.g. varchar(10)). This would help reduce memory usage in cases where a large no. of TypeSet are generated but each set has small no. of entries which are a few characters long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should assume the default size is one byte. I would update the comment a bit:
It can take up to 4 bytes per character due to UTF-8 encoding, but we assume the data is ASCII and only needs one byte.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comments in code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @sopel39's comments. I don't think we need the first commit that changes the fast utils IntArrayList
to an int[]
. The second commit looks good, but needs a fix for the CHAR
branch
10ab1b3
to
ddbf3ed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(just skimming)
ddbf3ed
to
937403b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM on Use getInt to access IntArrayList
. I suggest making two separate PRs for the two commits
else if (type instanceof VarcharType) { | ||
// If bound on length of varchar is smaller than defaultSize, use that as expected size | ||
return ((VarcharType) type).getLength() | ||
.map(length -> Math.min(length, defaultSize)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't think that assuming that varchar will occupy entire length is the correct one. It seems very pessimistic. What do you think @dain?
presto-spi/src/main/java/io/prestosql/spi/type/VarcharType.java
Outdated
Show resolved
Hide resolved
presto-spi/src/main/java/io/prestosql/spi/type/VarcharType.java
Outdated
Show resolved
Hide resolved
937403b
to
76cd2f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dain do you want to take a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
else if (type instanceof VarcharType) { | ||
// If bound on length of varchar is smaller than defaultSize, use that as expected size | ||
return ((VarcharType) type).getLength() | ||
.map(length -> Math.min(length, defaultSize)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should assume the default size is one byte. I would update the comment a bit:
It can take up to 4 bytes per character due to UTF-8 encoding, but we assume the data is ASCII and only needs one byte.
76cd2f6
to
3fa8534
Compare
merged, thanks! |
No description provided.