Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support outer Unnest/Explode in BigQuery dialect #2941

Closed
chelsea-lin opened this issue Feb 9, 2024 · 11 comments
Closed

Support outer Unnest/Explode in BigQuery dialect #2941

chelsea-lin opened this issue Feb 9, 2024 · 11 comments

Comments

@chelsea-lin
Copy link
Contributor

chelsea-lin commented Feb 9, 2024

Is your feature request related to a problem? Please describe.

  • Cannot preserve empty arrays: SQLGlot's current explode_to_unnest transformation for BigQuery cannot include null/empty arrays or EXPLODE_OUTER semantics as expected, leading to potential data loss.

  • Suboptimal Output: Even in successful EXPLODE translation, the generated BigQuery SQL is verbose and could be simplified.

Here is the sample API and outputs:

In [16]: sqlglot.parse_one("SELECT EXPLODE(arr) FROM cte").sql(dialect='bigquery')
Out[16]: 'SELECT IF(pos = pos_2, col, NULL) AS col FROM cte CROSS JOIN UNNEST(GENERATE_ARRAY(0, GREATEST(ARRAY_LENGTH(arr)) - 1)) AS pos CROSS JOIN UNNEST(arr) AS col WITH OFFSET AS pos_2 WHERE pos = pos_2 OR (pos > (ARRAY_LENGTH(arr) - 1) AND pos_2 = (ARRAY_LENGTH(arr) - 1))'


In [18]: sqlglot.parse_one("SELECT EXPLODE_OUTER(arr) FROM cte").sql(dialect='bigquery')
Out[18]: 'SELECT IF(pos = pos_2, col, NULL) AS col FROM cte CROSS JOIN UNNEST(GENERATE_ARRAY(0, GREATEST(ARRAY_LENGTH(arr)) - 1)) AS pos CROSS JOIN UNNEST(arr) AS col WITH OFFSET AS pos_2 WHERE pos = pos_2 OR (pos > (ARRAY_LENGTH(arr) - 1) AND pos_2 = (ARRAY_LENGTH(arr) - 1))'

IIUC, for BigQuery, the above SQL equals to:

SELECT col FROM cte CROSS JOIN UNNEST(arr) AS col

Flatting arrays with a CROSS JOIN excludes rows that have empty or NULL array. BigQuery supports LEFT JOIN to include these rows. However, sqlglot does not have equal behavior here:

Describe the solution you'd like
Create a custom explode_to_unnest function dedicated to BigQuery handling. This allow for:

  • EXPLODE_OUTER support: Introduce logic to accurately translate EXPLODE_OUTER into BigQuery SQL using LEFT JOIN.
  • Streamlined expression tree: Utilize BigQuery's native UNNEST operation for simpler EXPLODE translation. That can improves the efficiency and readability of generated BigQuery SQL.
In [2]: import sqlglot

In [3]: sqlglot.parse_one("SELECT EXPLODE(arr) FROM cte").sql(dialect='bigquery')
Out[3]: 'SELECT col FROM cte CROSS JOIN UNNEST(arr) AS col'

In [4]: sqlglot.parse_one("SELECT EXPLODE_OUTER(arr) FROM cte").sql(dialect='bigquery')
Out[4]: 'SELECT col FROM cte LEFT JOIN UNNEST(arr) AS col'

In [5]: sqlglot.parse_one("SELECT POSEXPLODE_OUTER(arr) FROM cte").sql(dialect='bigquery')
Out[5]: 'SELECT col, pos FROM cte LEFT JOIN UNNEST(arr) AS col WITH OFFSET AS pos'

Describe alternatives you've considered
Modifying the core explode_to_unnest might introduce incompatibilities with other database dialects that are currently handled correctly.

@tobymao
Copy link
Owner

tobymao commented Feb 9, 2024

can you give an actual example of bigwuery input and output and why it’s wrong?

we don’t do left join because presto doesn’t support left join unnest

@tobymao
Copy link
Owner

tobymao commented Feb 9, 2024

i tested this and it seems like the only case that isn't working as expected is explode_outer(array())

@tobymao tobymao closed this as completed in 844018b Feb 9, 2024
@chelsea-lin
Copy link
Contributor Author

chelsea-lin commented Feb 9, 2024

Thanks for the quick fixing.

@chelsea-lin
Copy link
Contributor Author

I am also trying to implement it by creating a custom explode_to_unnest in BigQuery. In this way, the translated query would be simpler (examples are shown in the issue descriptions). Do you have any thoughts about that?

@tobymao
Copy link
Owner

tobymao commented Feb 9, 2024

yea, it's easy to do this as a left join unnest, but i chose not to do this because presto can't handle it and sqlglot handles many dialects not just one.

the complexity of the sql is not a goal for sqlglot, simply accuracy.

@chelsea-lin
Copy link
Contributor Author

chelsea-lin commented Feb 9, 2024

The generated SQL still have one more issues, shown as below:
image

The 2nd query is trying to flatting a float array z, but the compiler complains the type mismatches because [NULL] is a integer array. To be noticed, the unnsted array can be any types, such as ARRAY<STRUCT<ARRAY<STRUCT....

@tobymao
Copy link
Owner

tobymao commented Feb 9, 2024

do you have a solution for this?

@chelsea-lin
Copy link
Contributor Author

chelsea-lin commented Feb 9, 2024

I didn't find out a way to correct the SQL with CROSS JOIN. But I sent a draft PR to define a custom function for BQ. The draft PR fails on some unit test currently. If that works for you, I will fix the unit test tomorrow for your official review.
#2944

@tobymao
Copy link
Owner

tobymao commented Feb 9, 2024

SELECT
  IF(pos = pos_2, col, NULL) AS col
FROM x
CROSS JOIN UNNEST(GENERATE_ARRAY(
  0,
  GREATEST(ARRAY_LENGTH(IF(ARRAY_LENGTH(COALESCE(a, [])) = 0, [a[SAFE_ORDINAL(0)]], a))) - 1
)) AS pos
CROSS JOIN UNNEST(IF(ARRAY_LENGTH(COALESCE(a, [])) = 0, [a[SAFE_ORDINAL(0)]], a)) AS col WITH OFFSET AS pos_2
WHERE
  pos = pos_2
  OR (
    pos > (
      ARRAY_LENGTH(IF(ARRAY_LENGTH(COALESCE(a, [])) = 0, [a[SAFE_ORDINAL(0)]], a)) - 1
    )
    AND pos_2 = (
      ARRAY_LENGTH(IF(ARRAY_LENGTH(COALESCE(a, [])) = 0, [a[SAFE_ORDINAL(0)]], a)) - 1
    )
  )

@tobymao
Copy link
Owner

tobymao commented Feb 9, 2024

b70a394

@chelsea-lin
Copy link
Contributor Author

Great to see that works! Thanks for quick fixing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants