-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase row limit when doing count() for HostColumnarToGpu #1868
Conversation
Signed-off-by: Thomas Graves <[email protected]>
Signed-off-by: Thomas Graves <[email protected]>
build |
_assert_gpu_and_cpu_are_equal(func, True, conf=conf) | ||
_assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf) | ||
|
||
def assert_gpu_and_cpu_are_equal_count(func, conf={}): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a huge fan of this API. It makes it really easy to accidentally only compare the number of rows for a query instead of the actual data of the query. I understand why you are trying to do this, but the same thing can be achieved with assert_gpu_and_cpu_are_equal_collect
and inserting a groupBy().count()
in there.
This is not a show stopper. If you want to keep it, that is fine, I would just like a warning in the doc string about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I thought the _count would imply that you are only comparing counts. We in general have no tests that test count() that I see and we have hit lots of issues with it so I thought it would be a convenient api to easily add tests without having to change the operation or a bunch of code.
I'm definitely open to changing if we think it will cause problems but I don't think groupBy().count() is as intuitive or convenient.
Perhaps just changing the name to assert_gpu_and_cpu_row_counts_equal and add more documentation? Or if you can think of another convenience api.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is fine with me. I mostly want to be sure that no one uses it on accident. Using it on purpose is fine.
@@ -279,6 +279,11 @@ class HostToGpuCoalesceIterator(iter: Iterator[ColumnarBatch], | |||
// schema and desired batch size | |||
batchRowLimit = GpuBatchUtils.estimateRowCount(goal.targetSizeBytes, | |||
GpuBatchUtils.estimateGpuMemory(schema, 512), 512) | |||
// when there aren't any columns, it generally means user is doing a count() and we don't |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice find.
Let us not call estimateRowCount/estimateGpuMemory
when there no columns:
batchRowLimit = if (batch.numCols() > 0 ) {
GpuBatchUtils.estimateRowCount(goal.targetSizeBytes,
GpuBatchUtils.estimateGpuMemory(schema, 512), 512)
} else {
Integer.MAX_VALUE
}
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, 🚀
) * Increase row limit when doing count() for HostColumnarToGpu Signed-off-by: Thomas Graves <[email protected]> * put test back in * comment * update test comment * update comment Signed-off-by: Thomas Graves <[email protected]> * Update count tests function, fix missed calls, and review comments
) * Increase row limit when doing count() for HostColumnarToGpu Signed-off-by: Thomas Graves <[email protected]> * put test back in * comment * update test comment * update comment Signed-off-by: Thomas Graves <[email protected]> * Update count tests function, fix missed calls, and review comments
If doing a count() with HostcolumnToGPU which ends up doing a coalesce, the row limit is currently set at 512. This can be very inefficient for larger data. So when there aren't any columns specified just set the row limit to max integer.
This improved one query from 2 minutes down to 6 seconds.
I also added a new testing function assert_gpu_and_cpu_are_equal_count to be able to easily test count() calls.
fixes #1864