bug(impala): generated CTEs in wrong order #7776

chris-park · 2023-12-15T17:13:51Z

What happened?

I'm extracting data from a large number of Impala tables. I have a set of fields that are common to all tables, and another set of fields which are unique to each table.

Here's a simplified version of the code I'm using to generate the SQL code:

import os
import ibis
import pandas as pd
from ibis import _

# Connect to Impala database
con = ibis.impala.connect(
    user=os.environ['USER'], 
    password=os.environ['PWD']
)

# Create a dataframe containing the unique field names
df_unique_fields = pd.DataFrame({
    'table_name': ['X1', 'X2', 'X3'],
    'val_id': [1, 2, 3],
    'val_code': ['A', 'B', 'C']
})

# Create a list of common field names
common_fields = [
    _.id,
    _.date,
    _.name
]

# Build the extraction query from each table.
exprs = []
for i in df_unique_fields.iterrows():
    dat= i[1]
    unique_fields = [
        ibis.literal(dat['val_id']).name("value_id"),
        _[dat['val_code']].name("value")
    ]
    all_fields = core_fields + unique_fields
    exprs.append(
        con.table(dat['table_name'], database='database_name').select(all_fields)
    )

# Combine all the data from subtables into a single UNION expression.
all_exprs = ibis.union(*exprs)

# Final filter
df_out = all_exprs.filter((_.date >= '2023-01-01') & (_.value.notnull()))

# Generate SQL
ibis.to_sql(df_out)

This generates SQL code where the order of the CTEs are jumbled - both the big UNION CTE (all_exprs) and the final filtering CTE (df_out) appear before the smaller CTEs extracting data from each underlying table.

Would you be able to help?

What version of ibis are you using?

Python 3.10.12 on Windows 10 using ibis 7.1.0.

What backend(s) are you using, if any?

Impala

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

cpcloud · 2023-12-16T11:22:37Z

@ergodiclife Thanks for the issue!

I am unable to reproduce the problem you see with a modified version of your code:

def test_cte_refs_in_topo_order2(snapshot):
    import pandas as pd

    # Connect to Impala database
    schemas = dict(
        X1=ibis.schema(dict(id="int64", date="date", name="string", A="int64")),
        X2=ibis.schema(dict(id="int64", date="date", name="string", B="int64")),
        X3=ibis.schema(dict(id="int64", date="date", name="string", C="int64")),
    )

    # Create a dataframe containing the unique field names
    df_unique_fields = pd.DataFrame(
        {
            "table_name": ["X1", "X2", "X3"],
            "val_id": [1, 2, 3],
            "val_code": ["A", "B", "C"],
        }
    )

    # Create a list of common field names
    core_fields = [_.id, _.date, _.name]

    # Build the extraction query from each table.
    exprs = []
    for i in df_unique_fields.iterrows():
        dat = i[1]
        unique_fields = [
            ibis.literal(dat["val_id"]).name("value_id"),
            _[dat["val_code"]].name("value"),
        ]
        all_fields = core_fields + unique_fields
        exprs.append(
            ibis.table(schemas[dat["table_name"]], dat["table_name"]).select(all_fields)
        )

    # Combine all the data from subtables into a single UNION expression.
    all_exprs = ibis.union(*exprs)

    # Final filter
    df_out = all_exprs.filter((_.date >= "2023-01-01") & (_.value.notnull()))

    # Generate SQL
    out = ibis.to_sql(df_out, dialect="impala")
    snapshot.assert_match(out, "out.sql")

produces this SQL:

WITH t1 AS (
  SELECT
    t4.`id`,
    t4.`date`,
    t4.`name`,
    2 AS `value_id`,
    t4.`B` AS `value`
  FROM `X2` AS t4
), t2 AS (
  SELECT
    t4.`id`,
    t4.`date`,
    t4.`name`,
    1 AS `value_id`,
    t4.`A` AS `value`
  FROM `X1` AS t4
), t3 AS (
  SELECT
    t4.`id`,
    t4.`date`,
    t4.`name`,
    3 AS `value_id`,
    t4.`C` AS `value`
  FROM `X3` AS t4
)
SELECT
  t0.`id`,
  t0.`date`,
  t0.`name`,
  t0.`value_id`,
  t0.`value`
FROM (
  SELECT
    *
  FROM t2
  UNION ALL
  SELECT
    *
  FROM t1
  UNION ALL
  SELECT
    *
  FROM t3
) AS t0
WHERE
  (
    t0.`date` >= '2023-01-01'
  ) AND (
    NOT t0.`value` IS NULL
  )

which looks correct to my eye.

Can you show the SQL output you're observing with your code? Can you try to create a copy and paste example that reproduces the incorrect output and that doesn't require connecting to Impala?

chris-park · 2023-12-17T08:39:10Z

Thanks Philip.

I've created a reproducible example following your steer, and also aligned it more closely with my actual code to replicate the error:

import ibis
import string
import pandas as pd

from ibis import _


# Generate schema using up to 'n' letters. 
n = 3
letters = string.ascii_uppercase[:n]
schemas = dict()

for i in range(n):
    d = dict(id="int64", date="date", name="string")
    d[letters[i]] = "int64"
    schemas[f"X{i + 1}"] = d 

# Create a dataframe containing the unique field names
df_unique_fields = pd.DataFrame(
    {
        "table_name": list(schemas.keys()),
        "val_id": [x + 1 for x in range(n)],
        "val_code": [x for x in letters],
    }
)

# Create a list of common field names
core_fields = [_.id, _.date, _.name]

# Build the extraction query from each table.
exprs = []

for i in df_unique_fields.iterrows():
    dat = i[1]
    unique_fields = [
        ibis.literal(dat["val_id"]).name("value_id"),
        _[dat["val_code"]].name("value"),
    ]
    all_fields = core_fields + unique_fields
    exprs.append(
        ibis.table(schemas[dat["table_name"]], dat["table_name"]).select(all_fields)
    )

# Combine all the data from subtables into a single UNION expression.
all_exprs = ibis.union(*exprs)

# Final output
df_out = (
    all_exprs.select(
        _.id,
        _.date.cast("timestamp").date().name("date"),
        _.name.concat("_", _.value_id.cast("string")).name("value_code"),
        _.name,
        _.value.round(4).name("value"),
    )
    .filter((_.date >= "2023-01-01") & (_.value.notnull()))
    .order_by([_.id, ibis.desc(_.date)])
)

# Generate SQL
out = ibis.to_sql(df_out, dialect="impala")

Now out produces the following incorrect SQL code, where the CTEs t0 and t1 are defined before t4 - t6:

WITH t0 AS (
  SELECT
    t3.`id`,
    t3.`date`,
    t3.`name`,
    t3.`value_id`,
    t3.`value`
  FROM (
    SELECT
      *
    FROM t5
    UNION ALL
    SELECT
      *
    FROM t4
    UNION ALL
    SELECT
      *
    FROM t6
  ) AS t3
), t1 AS (
  SELECT
    t0.`id`,
    TO_DATE(CAST(t0.`date` AS TIMESTAMP)) AS `date`,
    CONCAT(t0.`name`, '_', CAST(t0.`value_id` AS STRING)) AS `value_code`,
    t0.`name`,
    ROUND(t0.`value`, 4) AS `value`
  FROM t0
), t4 AS (
  SELECT
    t7.`id`,
    t7.`date`,
    t7.`name`,
    2 AS `value_id`,
    t7.`B` AS `value`
  FROM `X2` AS t7
), t5 AS (
  SELECT
    t7.`id`,
    t7.`date`,
    t7.`name`,
    1 AS `value_id`,
    t7.`A` AS `value`
  FROM `X1` AS t7
), t6 AS (
  SELECT
    t7.`id`,
    t7.`date`,
    t7.`name`,
    3 AS `value_id`,
    t7.`C` AS `value`
  FROM `X3` AS t7
)
SELECT
  t2.*
FROM (
  SELECT
    t1.*
  FROM t1
  WHERE
    (
      t1.`date` >= '2023-01-01'
    ) AND (
      NOT t1.`value` IS NULL
    )
) AS t2
ORDER BY
  t2.`id` ASC,
  t2.`date` DESC

cpcloud · 2023-12-18T14:44:41Z

@randompark Thank you!

We're undertaking a big internals refactor right now, and I've added a passing test for this case to #7580.

chris-park · 2023-12-19T16:45:19Z

Hi Philip - just out of curiosity, what might be causing this behaviour? the repetitive and redundant aliasing (as t7) of the table in each component CTE is also odd.

contang0 · 2024-01-04T06:52:27Z

I don't have an example right now, but I did notice something similar. Once the expression grows beyond certain depth I was getting errors related to what looked like erroneous aliases. I usually solve this by persisting the intermediate steps rather than building very large expressions.

cpcloud · 2024-01-04T10:44:47Z

@contang0 No need for an example. This issue will be fixed in 8.0 or 9.0 (likely 9.0). We've already added a test case for it in kszucs@d9591f7

jcrist · 2024-01-26T19:19:23Z

Since this has already been resolved in the-epic-split branch, I'm going to close this.

chris-park added the bug Incorrect behavior inside of ibis label Dec 15, 2023

github-project-automation bot added this to Ibis planning and roadmap Dec 15, 2023

github-project-automation bot moved this to backlog in Ibis planning and roadmap Dec 15, 2023

tswast changed the title ~~bug:~~ bug(impala): generated CTEs in wrong order Dec 15, 2023

tswast added the impala The Apache Impala backend label Dec 15, 2023

cpcloud added the not reproducible A bug report or issue than cannot be reproduced as reported label Dec 16, 2023

cpcloud added a commit to kszucs/ibis that referenced this issue Dec 18, 2023

test(cte-order): add test case for ibis-project#7776

d9591f7

cpcloud added chained joins The bane of Ibis's existence and removed not reproducible A bug report or issue than cannot be reproduced as reported labels Dec 18, 2023

cpcloud linked a pull request Dec 18, 2023 that will close this issue

refactor(ir): rethink the relational operations #7580

Closed

jcrist closed this as completed Jan 26, 2024

github-project-automation bot moved this from backlog to done in Ibis planning and roadmap Jan 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(impala): generated CTEs in wrong order #7776

bug(impala): generated CTEs in wrong order #7776

chris-park commented Dec 15, 2023 •

edited by cpcloud

Loading

cpcloud commented Dec 16, 2023 •

edited

Loading

chris-park commented Dec 17, 2023 •

edited

Loading

cpcloud commented Dec 18, 2023 •

edited

Loading

chris-park commented Dec 19, 2023

contang0 commented Jan 4, 2024

cpcloud commented Jan 4, 2024

jcrist commented Jan 26, 2024

bug(impala): generated CTEs in wrong order #7776

bug(impala): generated CTEs in wrong order #7776

Comments

chris-park commented Dec 15, 2023 • edited by cpcloud Loading

What happened?

What version of ibis are you using?

What backend(s) are you using, if any?

Relevant log output

Code of Conduct

cpcloud commented Dec 16, 2023 • edited Loading

chris-park commented Dec 17, 2023 • edited Loading

cpcloud commented Dec 18, 2023 • edited Loading

chris-park commented Dec 19, 2023

contang0 commented Jan 4, 2024

cpcloud commented Jan 4, 2024

jcrist commented Jan 26, 2024

chris-park commented Dec 15, 2023 •

edited by cpcloud

Loading

cpcloud commented Dec 16, 2023 •

edited

Loading

chris-park commented Dec 17, 2023 •

edited

Loading

cpcloud commented Dec 18, 2023 •

edited

Loading