Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Better support for CollectLimitExec #7005

Open
firestarman opened this issue Nov 4, 2022 · 3 comments
Open

[FEA] Better support for CollectLimitExec #7005

firestarman opened this issue Nov 4, 2022 · 3 comments
Labels
feature request New feature or request improve P2 Not required for release

Comments

@firestarman
Copy link
Collaborator

firestarman commented Nov 4, 2022

The current GPU implementation of CollectLimitExec (leveraging the GpuGlobalLimitExec) does not cover the executeCollect path, causing some small issues.
See #882 and #6814.

We may need to support this executeCollect path same as the CPU CollectLimitExec. Besides, we should also take care of the possible plan tree change when enabling the GPU CollectLimitExec, to make sure it will go into the executeCollect path for cases like take, show and limit. Since the root node (aka the last operation) of the plan will be probably GpuColumnarToRowExec, not the CollectLimitExec any more.

@firestarman firestarman added feature request New feature or request P2 Not required for release improve labels Nov 4, 2022
@revans2 revans2 added the ? - Needs Triage Need team to review and classify label Nov 4, 2022
@revans2
Copy link
Collaborator

revans2 commented Nov 4, 2022

Those are the reasons that we don't replace CollectLimitExec by default. It is difficult to actually make this work given our current rules (but not impossible). From a performance standpoint we might just want to insert a GpuLocalLimitExec ahead of the CollectLimitExec. We are going to have to pull back N rows to the CPU anyways for the executeCollect to work. In the common case only a single task is going to run with a CollectLimitExec and a reasonable number of rows, so why not just limit those rows before we copy it to the CPU?

@firestarman
Copy link
Collaborator Author

firestarman commented Nov 8, 2022

In the common case only a single task is going to run with a CollectLimitExec and a reasonable number of rows, so why not just limit those rows before we copy it to the CPU?

Hi, not fully get what you mean for the last suggestion.

If CollectLimitExec is always or commonly used as the last operation, how about marking it not support columnar processing, and implement the execute and executeCollect, where it calls the child's executeColumnar and fetch only the limit rows, similar to what GpuColumnarToRowExec does.

Besides, add a rule to make sure a ColumnarToRowExec will not be inserted before the CollectLimitExec . Then cases like show, take, head will run into executeCollect path, where it avoids the shuffle in execute.

Of course there should be many details I miss here, but this is an overall idea.

@revans2
Copy link
Collaborator

revans2 commented Nov 8, 2022

If CollectLimitExec is always or commonly used as the last operation, how about mark it not support columnar processing, and implement the execute and executeCollect, where it calls the child's executeColumnar and fetch only the limit rows, similar to what GpuColumnarToRowExec does.

That is essentially what I am suggesting, but hopefully with less code changes.

What we have today.

ColumnarProcessing ... -> ColumnarToRow -> CollectLimitExec

This works, but ColumnarToRow is going to move at least one full batch of data from the GPU to the CPU so that CollectLimitExec can throw away everything but the first N rows.

You are proposing to fix this by having a GpuCollectLimitExec that would do three steps.

  1. Limit the input data using the GPU
  2. Copy the columnar data to the CPU and convert it into rows
  3. Return the row based data using the executeCollect API
ColumnarProcessing ... -> |            GpuCollectLimitExec            |
                          | Limit -> ColumnarToRow -> executeCollect  |

This is fine and works, but it means that we have to mark GpuCollectLimitExec as doing the columnar to row transition so the planning rules handle it properly and don't try to insert in extra transitions.

What I am proposing is to use existing Execs to do the same thing.

ColumnarProcessing ... -> GpuLocalLimitExec -> GpuColumnarToRowExec -> CollectLimitExec

@firestarman your solution would end up being cleaner on the UI, which is a big advantage for end users. So if you are willing to put in the effort to make that happen I would love to see it. I was just thinking of how we could do it with as little work as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request improve P2 Not required for release
Projects
None yet
Development

No branches or pull requests

2 participants