Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve recursive unnest options API #12836

Merged
merged 14 commits into from
Oct 20, 2024
Merged

Conversation

duongcongtoai
Copy link
Contributor

@duongcongtoai duongcongtoai commented Oct 9, 2024

Which issue does this PR close?

Cleanup #11577

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions common Related to common crate labels Oct 9, 2024
unnest_with_options(input, unnestings, UnnestOptions::default())
}

pub fn get_unnested_list_datatype_recursive(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this method, recursion specification can be declared using UnnestOptions

/// will only be done once (depth = 1). In case recursion is needed on a multi-dimensional
/// list type, use [`ColumnUnnestList`]
#[derive(Debug, Clone, PartialEq, Eq, Hash, PartialOrd)]
pub enum ColumnUnnestType {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This struct is now a part of the builder, no need to expose to api

@github-actions github-actions bot added the physical-expr Physical Expressions label Oct 17, 2024
@duongcongtoai duongcongtoai marked this pull request as ready for review October 17, 2024 19:35
@duongcongtoai duongcongtoai changed the title Cleanup TODO in recursive unnest Cleanup TODO in recursive unnest Oct 17, 2024
@github-actions github-actions bot added the proto Related to proto crate label Oct 17, 2024
@alamb
Copy link
Contributor

alamb commented Oct 18, 2024

Merging up from main to get CI to pass

@alamb alamb added the api change Changes the API exposed to users of the crate label Oct 18, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @duongcongtoai -- I think this is a significant improvement in code cleaniless

I think it would be nice to add some comments, as I have described below, but I also don't think it is strictly necessary and could be done after merge

@@ -64,13 +66,24 @@
pub struct UnnestOptions {
/// Should nulls in the input be preserved? Defaults to true
pub preserve_nulls: bool,
/// Without explicit recursions, all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please document what the Vec means in this context? Is the option for each of the columns in the schema? If so can you say it is expected to have the same number of elements?

If possible it would also be great to update the documentation on this struct to explain what is going on with recursive options

}

#[derive(Debug, Clone, PartialEq, Eq, Hash, PartialOrd)]
pub struct RecursionUnnestOption {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add documentation to this structure as it is public?

/// if one column is a list type, it can be recursively and simultaneously
/// unnested into the desired recursion levels
/// e.g select unnest(list_col,depth=1), unnest(list_col,depth=2)
pub fn unnest_columns_recursive_with_options(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 2243 to 2254
UnnestOptions::default().with_recursions(vec![
RecursionUnnestOption {
input_column: "stringss".into(),
output_column: "stringss_depth_1".into(),
depth: 1,
},
RecursionUnnestOption {
input_column: "stringss".into(),
output_column: "stringss_depth_2".into(),
depth: 2,
},
]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is another potential way for this API to look would be to take individual options rather than a single Vec

I don't feel strongly about this

Suggested change
UnnestOptions::default().with_recursions(vec![
RecursionUnnestOption {
input_column: "stringss".into(),
output_column: "stringss_depth_1".into(),
depth: 1,
},
RecursionUnnestOption {
input_column: "stringss".into(),
output_column: "stringss_depth_2".into(),
depth: 2,
},
]),
UnnestOptions::new()
.with_recursion(
RecursionUnnestOption {
input_column: "stringss".into(),
output_column: "stringss_depth_1".into(),
depth: 1,
})
.with_recusion(
RecursionUnnestOption {
input_column: "stringss".into(),
output_column: "stringss_depth_2".into(),
depth: 2,
},
),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you, i refactored so that the recursion is optional at column level, and since None can be replaced by an empty vector, i simplified the type of this field from Option to Vec

"| | | e |",
"+---------------------------------+---------------------------------+---------------------------------+",
];
assert_batches_eq!(expected, &[ret]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -301,7 +301,7 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
// The transformation happen bottom up, one at a time for each iteration
// Only exhaust the loop if no more unnest transformation is found
for i in 0.. {
let mut unnest_columns = vec![];
let mut unnest_columns = HashMap::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to use an IndexMap here (to retain the same order as was specified in the query) -- a HashMap can mix up the order

Comment on lines 467 to 478
match maybe_list_unnest {
None => {}
Some(list_unnest) => {
list_recursions.extend(list_unnest.into_iter().map(
|unnest_list| RecursionUnnestOption {
input_column: col.clone(),
output_column: unnest_list.output_column,
depth: unnest_list.depth,
},
));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could simplify this code (reduce recursion level) with if let like

Suggested change
match maybe_list_unnest {
None => {}
Some(list_unnest) => {
list_recursions.extend(list_unnest.into_iter().map(
|unnest_list| RecursionUnnestOption {
input_column: col.clone(),
output_column: unnest_list.output_column,
depth: unnest_list.depth,
},
));
}
}
if let Some(list_unnest) = maybe_list_unnest {
list_recursions.extend(list_unnest.into_iter().map(
|unnest_list| RecursionUnnestOption {
input_column: col.clone(),
output_column: unnest_list.output_column,
depth: unnest_list.depth,
},
));
}
}

Also I wonder if it would be possible / easier for rewrite_recursive_unnests_bottom_up to simply return a HashSet<Column, RecursionUnnestOption> directly rather than having to convert the results

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, i can refactor this function so it returns 3 result:

  • inner_projection_exprs
  • unnest_placeholder_columns
  • outer_projection_exprs

@alamb alamb changed the title Cleanup TODO in recursive unnest Improve recursive unnest options API Oct 20, 2024
@alamb
Copy link
Contributor

alamb commented Oct 20, 2024

Looks great -- thanks again @duongcongtoai

@alamb alamb merged commit c7e5d8d into apache:main Oct 20, 2024
27 checks passed
Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate common Related to common crate logical-expr Logical plan and expressions physical-expr Physical Expressions proto Related to proto crate sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants