-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error on value counter overflow instead of writing sad segments #9559
error on value counter overflow instead of writing sad segments #9559
Conversation
public static String formatMessage(String columnName) | ||
{ | ||
return StringUtils.format( | ||
"Too many values to store for %s column, try reducing maxRowsPerSegment", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for this operator friendly error message
Can this happen? |
@@ -29,5 +29,6 @@ | |||
public interface ColumnarDoublesSerializer extends Serializer | |||
{ | |||
void open() throws IOException; | |||
int size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
javadocs please. What does the size indicate? Number of rows in the column? Or space it's taking up. Reading the code it looks like it's the former
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aw, I was just making it symmetrical with ColumnarFloatsSerializer and ColumnarLongsSerializer, neither of which have any docs either.
This area of the code in general is sort of barren of javadocs, so how about a bargain: unless there is something else to change on this PR, how about I just do a follow-up that adds javadocs to a bunch of this stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds like a good deal to me. I'm almost done reviewing, pressed the wrong button when I made that comment. I don't have anything else yet.
It shouldn't happen, but nothing in the serializer was ensuring this, so it is a defensive mechanism since this part of the code is rather far away from where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VSizeColumnarInts#get
does this calculation
@Override
public int get(int index)
{
return buffer.getInt(buffer.position() + (index * numBytes)) >>> bitsToShift;
}
^ for a large index, index * numBytes
can overflow. I haven't looked through all types of ColumnarInts yet, but is there a way to test that they can read from segments with very large offsets?
Maybe generate some with the tests you've already written, and use those to read large offsets?
Ah, i forgot to mention in the description that I didn't update |
@clintropolis Makes sense. I'm just going to read through all the implementations now... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks for the explanations!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
[Backport] error on value counter overflow instead of writing sad segments (apache#9559)
Description
This PR fixes an integer overflow issue if too many values are written to a column within a single segment, preferring to fail at segment creation time instead of write a sad segment with negative values.
This issue was first noticed with multi-value columns, exploding at query time with an error of the form:
but would also occur for any serializer given more than
Integer.MAX_VALUES
rows as input. tl;dr too many values were written to a single segment so the 'offsets' portion of the multi-value column overflowed into negative numbers.To fix, primitive column serializers now check the number of values (row count in most cases, total number of values for the case of multi-value strings) to ensure that it does not extend beyond the values that will be expressed in the column header and won't cause any issues at query time. A new exception,
ColumnCapacityExceededException
has been added which will give an error message that suggestswhere
%s
is the column name (which all the serializers now know).I added a bunch of tests to confirm that this works, and also marked them all
@Ignore
because they take forever to run. The sameIAE
error can be replicated by runningV3CompressedVSizeColumnarMultiIntsSerializerTest.testTooManyValues
without the modifications to check that overflow has occurred.I also added a
CompressedDoublesSerdeTest
that copiesCompressedFloatsSerdeTest
since i noticed there wasn't a test for double columns.Finally, I ran into an issue with
IntermediateColumnarLongsSerializer
that made it so that I could not test the case when you write too many values to the column, as it must store the entire column on heap while it determines the best encoding, so my attempts to run the test were met with an oom exception. This should probably be fixed, or we should advise against using 'auto' encoding for larger segments, but I did neither in this PR.This PR has:
Key changed/added classes in this PR