Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement unnest function #6555

Closed
izveigor opened this issue Jun 6, 2023 · 9 comments · Fixed by #10044
Closed

Implement unnest function #6555

izveigor opened this issue Jun 6, 2023 · 9 comments · Fixed by #10044
Assignees
Labels
enhancement New feature or request

Comments

@izveigor
Copy link
Contributor

izveigor commented Jun 6, 2023

Is your feature request related to a problem or challenge?

Follow on to #6384
It would be nice to implement unnest function (with the properties like the analog in PostgreSQL) in arrow-datafusion.

Describe the solution you'd like

Main benefits for adding this feature:

  1. With unnest function we can use aggregate functions for arrays:
SELECT sum(a) AS total FROM (SELECT unnest(make_array(3, 5, 6) AS a) AS b;
----
14
  1. unnest function serves as an exchange between arrays and columns, we have 2 cases of behavior:
  • unnest with single argument
  • unnest with multiple argument (more than 1) (this form is only allowed in a query's FROM clause)

Examples:

unnest(make_array(1, 2))
----
1
2
select * from unnest(make_array(1, 2, 3), make_array('h', 'e', 'l', 'l', 'o')
----
1 h
2 e
3 l
  l
  o

Describe alternatives you've considered

For aggregate functions, we can create a lot of individual functions for aggregate functions (like array_sum), but I think this implementation would be too redundant.

Additional context

Similar Issues:
#6119

Similar PR:
#6384
#5106

Links to sources:
https://www.postgresql.org/docs/current/functions-array.html

@parkma99
Copy link
Contributor

parkma99 commented Jun 9, 2023

I'd like to work on this . Is anyone already preparing a PR?
I find it's hard to me.🥲

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jun 17, 2023

The return type of Unnest might be Vec<ArrayRef> or Vec<ColumnValue>, a set of rows (Vec) with an array of columns (ArrayRef). https://github.com/apache/arrow-datafusion/blob/9b419b19a66bdd35e9e5c0bca259786f8f3c3965/datafusion/expr/src/columnar_value.rs#L32-L38

Maybe we can add another type like RowsOfArray(Vec<ArrayRef>) for the rows-based return type? Are there other alternative return types of Unnest?

I think the data layout is quite similar to ValuesExec, so we might need Vec of some things as the sets of rows.
https://github.com/apache/arrow-datafusion/blob/9b419b19a66bdd35e9e5c0bca259786f8f3c3965/datafusion/core/src/physical_plan/values.rs#L46-L88

@jayzhan211
Copy link
Contributor

I found that we can apply LogicalPlan::Unnest instead of implementing Unnest in ScalarFunction.

@izveigor
Copy link
Contributor Author

Yes @jayzhan211, I know about this feature: (See PR: #5106).
It only works in Rust implementation (the function unnest_columns) but it doesn't in SQL. I only found the expression InUnnest in sqlparser-rs: https://docs.rs/sqlparser/latest/sqlparser/ast/enum.Expr.html#variant.InUnnest.
Therefore, I wonder how we will recognize this token 🤔

Also If we want to stick to the standard implementation of the function (MySQL, PostgreSQL and other),I think we should implement the scalar function as well.

@jackwener @alamb @tustvold What do you think about it?

@alamb
Copy link
Contributor

alamb commented Jun 21, 2023

Also If we want to stick to the standard implementation of the function (MySQL, PostgreSQL and other),I think we should implement the scalar function as well.

I don't understand what a scalar function would do -- isn't unnest effectively a table function (in the sense that it outputs a "table" (in DataFusion terms, a stream of RecordBatches)

Thus I wonder if the SQL planner (or maybe some optimizer pass) could replace all instances of Expr::Unnest with a subquery (which had the relevant LogialPlan::Unnest) 🤔

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jun 24, 2023

I have question to unnest multi-columns.
I am considering applying unnest for each array and join them together, is there an existing plan for this joining?
For example, unnest([1,2], [3,4,5]) is simplified to unnest_plan([1,2]), unnest_plan([3,4,5])

Ok(Some(Union
  Unnest: unnest(make_array(Int64(1),Int64(2)),make_array(Int64(3),Int64(4)))_0
    Projection: List([1,2]) AS unnest(make_array(Int64(1),Int64(2)),make_array(Int64(3),Int64(4)))_0
      EmptyRelation
  Unnest: unnest(make_array(Int64(1),Int64(2)),make_array(Int64(3),Int64(4)))_1
    Projection: List([3,4]) AS unnest(make_array(Int64(1),Int64(2)),make_array(Int64(3),Int64(4)))_1
      EmptyRelation))

Expected result

1 3
2 4
x 5

For those Unnest, apply some kind of Join for them. I did not find any JoinType can concat rows side by side. Are there any existing ways to achieve this?

@alamb
Copy link
Contributor

alamb commented Jun 24, 2023

. I did not find any JoinType can concat rows side by side. Are there any existing ways to achieve this?

Maybe you could use row_number() window function to generate a number for each subquery and then use that row number as the join key?

@jonahgao
Copy link
Member

jonahgao commented Apr 9, 2024

take

@jonahgao
Copy link
Member

jonahgao commented Apr 9, 2024

  • unnest with multiple argument (more than 1) (this form is only allowed in a query's FROM clause)

I plan to implement this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants