-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug(impala): generated CTEs in wrong order #7776
Comments
@ergodiclife Thanks for the issue! I am unable to reproduce the problem you see with a modified version of your code: def test_cte_refs_in_topo_order2(snapshot):
import pandas as pd
# Connect to Impala database
schemas = dict(
X1=ibis.schema(dict(id="int64", date="date", name="string", A="int64")),
X2=ibis.schema(dict(id="int64", date="date", name="string", B="int64")),
X3=ibis.schema(dict(id="int64", date="date", name="string", C="int64")),
)
# Create a dataframe containing the unique field names
df_unique_fields = pd.DataFrame(
{
"table_name": ["X1", "X2", "X3"],
"val_id": [1, 2, 3],
"val_code": ["A", "B", "C"],
}
)
# Create a list of common field names
core_fields = [_.id, _.date, _.name]
# Build the extraction query from each table.
exprs = []
for i in df_unique_fields.iterrows():
dat = i[1]
unique_fields = [
ibis.literal(dat["val_id"]).name("value_id"),
_[dat["val_code"]].name("value"),
]
all_fields = core_fields + unique_fields
exprs.append(
ibis.table(schemas[dat["table_name"]], dat["table_name"]).select(all_fields)
)
# Combine all the data from subtables into a single UNION expression.
all_exprs = ibis.union(*exprs)
# Final filter
df_out = all_exprs.filter((_.date >= "2023-01-01") & (_.value.notnull()))
# Generate SQL
out = ibis.to_sql(df_out, dialect="impala")
snapshot.assert_match(out, "out.sql") produces this SQL: WITH t1 AS (
SELECT
t4.`id`,
t4.`date`,
t4.`name`,
2 AS `value_id`,
t4.`B` AS `value`
FROM `X2` AS t4
), t2 AS (
SELECT
t4.`id`,
t4.`date`,
t4.`name`,
1 AS `value_id`,
t4.`A` AS `value`
FROM `X1` AS t4
), t3 AS (
SELECT
t4.`id`,
t4.`date`,
t4.`name`,
3 AS `value_id`,
t4.`C` AS `value`
FROM `X3` AS t4
)
SELECT
t0.`id`,
t0.`date`,
t0.`name`,
t0.`value_id`,
t0.`value`
FROM (
SELECT
*
FROM t2
UNION ALL
SELECT
*
FROM t1
UNION ALL
SELECT
*
FROM t3
) AS t0
WHERE
(
t0.`date` >= '2023-01-01'
) AND (
NOT t0.`value` IS NULL
) which looks correct to my eye. Can you show the SQL output you're observing with your code? Can you try to create a copy and paste example that reproduces the incorrect output and that doesn't require connecting to Impala? |
Thanks Philip. I've created a reproducible example following your steer, and also aligned it more closely with my actual code to replicate the error: import ibis
import string
import pandas as pd
from ibis import _
# Generate schema using up to 'n' letters.
n = 3
letters = string.ascii_uppercase[:n]
schemas = dict()
for i in range(n):
d = dict(id="int64", date="date", name="string")
d[letters[i]] = "int64"
schemas[f"X{i + 1}"] = d
# Create a dataframe containing the unique field names
df_unique_fields = pd.DataFrame(
{
"table_name": list(schemas.keys()),
"val_id": [x + 1 for x in range(n)],
"val_code": [x for x in letters],
}
)
# Create a list of common field names
core_fields = [_.id, _.date, _.name]
# Build the extraction query from each table.
exprs = []
for i in df_unique_fields.iterrows():
dat = i[1]
unique_fields = [
ibis.literal(dat["val_id"]).name("value_id"),
_[dat["val_code"]].name("value"),
]
all_fields = core_fields + unique_fields
exprs.append(
ibis.table(schemas[dat["table_name"]], dat["table_name"]).select(all_fields)
)
# Combine all the data from subtables into a single UNION expression.
all_exprs = ibis.union(*exprs)
# Final output
df_out = (
all_exprs.select(
_.id,
_.date.cast("timestamp").date().name("date"),
_.name.concat("_", _.value_id.cast("string")).name("value_code"),
_.name,
_.value.round(4).name("value"),
)
.filter((_.date >= "2023-01-01") & (_.value.notnull()))
.order_by([_.id, ibis.desc(_.date)])
)
# Generate SQL
out = ibis.to_sql(df_out, dialect="impala") Now WITH t0 AS (
SELECT
t3.`id`,
t3.`date`,
t3.`name`,
t3.`value_id`,
t3.`value`
FROM (
SELECT
*
FROM t5
UNION ALL
SELECT
*
FROM t4
UNION ALL
SELECT
*
FROM t6
) AS t3
), t1 AS (
SELECT
t0.`id`,
TO_DATE(CAST(t0.`date` AS TIMESTAMP)) AS `date`,
CONCAT(t0.`name`, '_', CAST(t0.`value_id` AS STRING)) AS `value_code`,
t0.`name`,
ROUND(t0.`value`, 4) AS `value`
FROM t0
), t4 AS (
SELECT
t7.`id`,
t7.`date`,
t7.`name`,
2 AS `value_id`,
t7.`B` AS `value`
FROM `X2` AS t7
), t5 AS (
SELECT
t7.`id`,
t7.`date`,
t7.`name`,
1 AS `value_id`,
t7.`A` AS `value`
FROM `X1` AS t7
), t6 AS (
SELECT
t7.`id`,
t7.`date`,
t7.`name`,
3 AS `value_id`,
t7.`C` AS `value`
FROM `X3` AS t7
)
SELECT
t2.*
FROM (
SELECT
t1.*
FROM t1
WHERE
(
t1.`date` >= '2023-01-01'
) AND (
NOT t1.`value` IS NULL
)
) AS t2
ORDER BY
t2.`id` ASC,
t2.`date` DESC |
@randompark Thank you! We're undertaking a big internals refactor right now, and I've added a passing test for this case to #7580. |
Hi Philip - just out of curiosity, what might be causing this behaviour? the repetitive and redundant aliasing (as |
I don't have an example right now, but I did notice something similar. Once the expression grows beyond certain depth I was getting errors related to what looked like erroneous aliases. I usually solve this by persisting the intermediate steps rather than building very large expressions. |
@contang0 No need for an example. This issue will be fixed in 8.0 or 9.0 (likely 9.0). We've already added a test case for it in kszucs@d9591f7 |
Since this has already been resolved in |
What happened?
I'm extracting data from a large number of
Impala
tables. I have a set of fields that are common to all tables, and another set of fields which are unique to each table.Here's a simplified version of the code I'm using to generate the SQL code:
This generates SQL code where the order of the CTEs are jumbled - both the big
UNION
CTE (all_exprs
) and the final filtering CTE (df_out
) appear before the smaller CTEs extracting data from each underlying table.Would you be able to help?
What version of ibis are you using?
Python
3.10.12 on Windows 10 usingibis
7.1.0.What backend(s) are you using, if any?
Impala
Relevant log output
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: