-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-34484: [Substrait] add an option to disable augmented fields #41583
Conversation
cpp/src/arrow/dataset/scanner.h
Outdated
@@ -287,10 +290,12 @@ struct ARROW_DS_EXPORT ProjectionDescr { | |||
|
|||
/// \brief Create a default projection referencing fields in the dataset schema | |||
static Result<ProjectionDescr> FromNames(std::vector<std::string> names, | |||
const Schema& dataset_schema); | |||
const Schema& dataset_schema, | |||
bool add_augmented_fields); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we set a default value to ensure we don't break any existing consumers of this function?
cpp/src/arrow/dataset/scanner.h
Outdated
|
||
/// \brief Make a projection that projects every field in the dataset schema | ||
static Result<ProjectionDescr> Default(const Schema& dataset_schema); | ||
static Result<ProjectionDescr> Default(const Schema& dataset_schema, | ||
bool add_augmented_fields); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Another alternative could be to pass through a ScanOptions data structure (still would require a default value though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this looks fine to me, just a single nit pick from my end. But i'll leave final approval to someone more familiar with this section of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks :)
plan->StartProducing(); | ||
ASSERT_FINISHES_OK(plan->finished()); | ||
} | ||
|
||
TEST(Substrait, RelWithHint) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit: I think we have one or two spots in python where we have to do a column selection to workaround this issue. We can probably remove these now.
e.g. https://github.com/apache/arrow/blob/main/python/pyarrow/tests/test_substrait.py#L93
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit a4a5cf1. There was 1 benchmark result indicating a performance regression:
The full Conbench report has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…apache#41583) ### Rationale for this change Augmented fields interfere with the schema passing between nodes. When enabled they cause names/schema mismatching at the end of the plan. ### What changes are included in this PR? Adds an option to disable augmented fields (defaulting to adding them), connects it everywhere it is called, and disables it in ReadRel conversion. ### Are these changes tested? Yes. ### Are there any user-facing changes? There are no API related changes however this will allow Substrait plans that consume local files to work without requiring a project/emit relation after the read relation to remove the unexpected fields. * GitHub Issue: apache#34484 Authored-by: David Sisson <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…apache#41583) ### Rationale for this change Augmented fields interfere with the schema passing between nodes. When enabled they cause names/schema mismatching at the end of the plan. ### What changes are included in this PR? Adds an option to disable augmented fields (defaulting to adding them), connects it everywhere it is called, and disables it in ReadRel conversion. ### Are these changes tested? Yes. ### Are there any user-facing changes? There are no API related changes however this will allow Substrait plans that consume local files to work without requiring a project/emit relation after the read relation to remove the unexpected fields. * GitHub Issue: apache#34484 Authored-by: David Sisson <[email protected]> Signed-off-by: Matt Topol <[email protected]>
Rationale for this change
Augmented fields interfere with the schema passing between nodes. When enabled they cause names/schema mismatching at the end of the plan.
What changes are included in this PR?
Adds an option to disable augmented fields (defaulting to adding them), connects it everywhere it is called, and disables it in ReadRel conversion.
Are these changes tested?
Yes.
Are there any user-facing changes?
There are no API related changes however this will allow Substrait plans that consume local files to work without requiring a project/emit relation after the read relation to remove the unexpected fields.