Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Compute][Dataset] Add dataset::WriteNode for writing rows from an ExecPlan to disk #29194

Closed
2 tasks done
asfimport opened this issue Aug 3, 2021 · 6 comments
Closed
2 tasks done

Comments

@asfimport
Copy link
Collaborator

asfimport commented Aug 3, 2021

This will serve as a sink ExecNode which dumps all the batches it receives to disk. The PR should probably also replace FileSystemDataset::Write with an ExecPlan based implementation

Reporter: Ben Kietzman / @bkietz
Assignee: Weston Pace / @westonpace

Subtasks:

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-13542. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Ben Kietzman / @bkietz:
@lidavidm

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
I can take a look. @westonpace were you already looking/planning to look at this area?

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
Not this week but if it's still open next week I'll take it. I'm going to assign it to myself but feel free to steal it if you get to it before me (I'll mark it "In Progress" when I actually start working on it)

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
@bkietz How do you envision partitioning being handled? Would that be a separate node or part of this node?

@asfimport
Copy link
Collaborator Author

Ben Kietzman / @bkietz:
Currently I was thinking that partitioning would be handled within this node, since that'd be the most straightforward extraction of a node from FileSystemDataset::Write.

If you wanted to extract a compute::PartitionNode instead, that'd probably be useful later on. I think PartitionNode would:

  • use a Grouper for id-ing their destination partition

  • sort batches by their partition id

  • emit slices of input batches with equal partition id

  • the partition expression is stored in ExecBatch::guarantee
    (note: does not utilize a dataset::Partitioning)

    Then WriteNode would only use a Partitioning to format ExecBatch::guarantees to an output directory. I think this approach would allow us to delete Partitioning::Partition too, since that behavior would now be encapsulated by PartitionNode.

    Also note that whatever approach you take is going to impinge on ARROW-13338 since ExecPlans don't support sync scanning and FileSystemDataset::Write depends on [[deprecated]] Scanner::Scan

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
Issue resolved by pull request 11017
#11017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants