Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(impala): generated CTEs in wrong order #7776

Closed
1 task done
chris-park opened this issue Dec 15, 2023 · 7 comments
Closed
1 task done

bug(impala): generated CTEs in wrong order #7776

chris-park opened this issue Dec 15, 2023 · 7 comments
Labels
bug Incorrect behavior inside of ibis chained joins The bane of Ibis's existence impala The Apache Impala backend

Comments

@chris-park
Copy link
Contributor

chris-park commented Dec 15, 2023

What happened?

I'm extracting data from a large number of Impala tables. I have a set of fields that are common to all tables, and another set of fields which are unique to each table.

Here's a simplified version of the code I'm using to generate the SQL code:

import os
import ibis
import pandas as pd
from ibis import _

# Connect to Impala database
con = ibis.impala.connect(
    user=os.environ['USER'], 
    password=os.environ['PWD']
)

# Create a dataframe containing the unique field names
df_unique_fields = pd.DataFrame({
    'table_name': ['X1', 'X2', 'X3'],
    'val_id': [1, 2, 3],
    'val_code': ['A', 'B', 'C']
})

# Create a list of common field names
common_fields = [
    _.id,
    _.date,
    _.name
]

# Build the extraction query from each table.
exprs = []
for i in df_unique_fields.iterrows():
    dat= i[1]
    unique_fields = [
        ibis.literal(dat['val_id']).name("value_id"),
        _[dat['val_code']].name("value")
    ]
    all_fields = core_fields + unique_fields
    exprs.append(
        con.table(dat['table_name'], database='database_name').select(all_fields)
    )

# Combine all the data from subtables into a single UNION expression.
all_exprs = ibis.union(*exprs)

# Final filter
df_out = all_exprs.filter((_.date >= '2023-01-01') & (_.value.notnull()))

# Generate SQL
ibis.to_sql(df_out)

This generates SQL code where the order of the CTEs are jumbled - both the big UNION CTE (all_exprs) and the final filtering CTE (df_out) appear before the smaller CTEs extracting data from each underlying table.

Would you be able to help?

What version of ibis are you using?

Python 3.10.12 on Windows 10 using ibis 7.1.0.

What backend(s) are you using, if any?

Impala

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@chris-park chris-park added the bug Incorrect behavior inside of ibis label Dec 15, 2023
@tswast tswast changed the title bug: bug(impala): generated CTEs in wrong order Dec 15, 2023
@tswast tswast added the impala The Apache Impala backend label Dec 15, 2023
@cpcloud
Copy link
Member

cpcloud commented Dec 16, 2023

@ergodiclife Thanks for the issue!

I am unable to reproduce the problem you see with a modified version of your code:

def test_cte_refs_in_topo_order2(snapshot):
    import pandas as pd

    # Connect to Impala database
    schemas = dict(
        X1=ibis.schema(dict(id="int64", date="date", name="string", A="int64")),
        X2=ibis.schema(dict(id="int64", date="date", name="string", B="int64")),
        X3=ibis.schema(dict(id="int64", date="date", name="string", C="int64")),
    )

    # Create a dataframe containing the unique field names
    df_unique_fields = pd.DataFrame(
        {
            "table_name": ["X1", "X2", "X3"],
            "val_id": [1, 2, 3],
            "val_code": ["A", "B", "C"],
        }
    )

    # Create a list of common field names
    core_fields = [_.id, _.date, _.name]

    # Build the extraction query from each table.
    exprs = []
    for i in df_unique_fields.iterrows():
        dat = i[1]
        unique_fields = [
            ibis.literal(dat["val_id"]).name("value_id"),
            _[dat["val_code"]].name("value"),
        ]
        all_fields = core_fields + unique_fields
        exprs.append(
            ibis.table(schemas[dat["table_name"]], dat["table_name"]).select(all_fields)
        )

    # Combine all the data from subtables into a single UNION expression.
    all_exprs = ibis.union(*exprs)

    # Final filter
    df_out = all_exprs.filter((_.date >= "2023-01-01") & (_.value.notnull()))

    # Generate SQL
    out = ibis.to_sql(df_out, dialect="impala")
    snapshot.assert_match(out, "out.sql")

produces this SQL:

WITH t1 AS (
  SELECT
    t4.`id`,
    t4.`date`,
    t4.`name`,
    2 AS `value_id`,
    t4.`B` AS `value`
  FROM `X2` AS t4
), t2 AS (
  SELECT
    t4.`id`,
    t4.`date`,
    t4.`name`,
    1 AS `value_id`,
    t4.`A` AS `value`
  FROM `X1` AS t4
), t3 AS (
  SELECT
    t4.`id`,
    t4.`date`,
    t4.`name`,
    3 AS `value_id`,
    t4.`C` AS `value`
  FROM `X3` AS t4
)
SELECT
  t0.`id`,
  t0.`date`,
  t0.`name`,
  t0.`value_id`,
  t0.`value`
FROM (
  SELECT
    *
  FROM t2
  UNION ALL
  SELECT
    *
  FROM t1
  UNION ALL
  SELECT
    *
  FROM t3
) AS t0
WHERE
  (
    t0.`date` >= '2023-01-01'
  ) AND (
    NOT t0.`value` IS NULL
  )

which looks correct to my eye.

Can you show the SQL output you're observing with your code? Can you try to create a copy and paste example that reproduces the incorrect output and that doesn't require connecting to Impala?

@cpcloud cpcloud added the not reproducible A bug report or issue than cannot be reproduced as reported label Dec 16, 2023
@chris-park
Copy link
Contributor Author

chris-park commented Dec 17, 2023

Thanks Philip.

I've created a reproducible example following your steer, and also aligned it more closely with my actual code to replicate the error:

import ibis
import string
import pandas as pd

from ibis import _


# Generate schema using up to 'n' letters. 
n = 3
letters = string.ascii_uppercase[:n]
schemas = dict()

for i in range(n):
    d = dict(id="int64", date="date", name="string")
    d[letters[i]] = "int64"
    schemas[f"X{i + 1}"] = d 

# Create a dataframe containing the unique field names
df_unique_fields = pd.DataFrame(
    {
        "table_name": list(schemas.keys()),
        "val_id": [x + 1 for x in range(n)],
        "val_code": [x for x in letters],
    }
)

# Create a list of common field names
core_fields = [_.id, _.date, _.name]

# Build the extraction query from each table.
exprs = []

for i in df_unique_fields.iterrows():
    dat = i[1]
    unique_fields = [
        ibis.literal(dat["val_id"]).name("value_id"),
        _[dat["val_code"]].name("value"),
    ]
    all_fields = core_fields + unique_fields
    exprs.append(
        ibis.table(schemas[dat["table_name"]], dat["table_name"]).select(all_fields)
    )

# Combine all the data from subtables into a single UNION expression.
all_exprs = ibis.union(*exprs)

# Final output
df_out = (
    all_exprs.select(
        _.id,
        _.date.cast("timestamp").date().name("date"),
        _.name.concat("_", _.value_id.cast("string")).name("value_code"),
        _.name,
        _.value.round(4).name("value"),
    )
    .filter((_.date >= "2023-01-01") & (_.value.notnull()))
    .order_by([_.id, ibis.desc(_.date)])
)

# Generate SQL
out = ibis.to_sql(df_out, dialect="impala")

Now out produces the following incorrect SQL code, where the CTEs t0 and t1 are defined before t4 - t6:

WITH t0 AS (
  SELECT
    t3.`id`,
    t3.`date`,
    t3.`name`,
    t3.`value_id`,
    t3.`value`
  FROM (
    SELECT
      *
    FROM t5
    UNION ALL
    SELECT
      *
    FROM t4
    UNION ALL
    SELECT
      *
    FROM t6
  ) AS t3
), t1 AS (
  SELECT
    t0.`id`,
    TO_DATE(CAST(t0.`date` AS TIMESTAMP)) AS `date`,
    CONCAT(t0.`name`, '_', CAST(t0.`value_id` AS STRING)) AS `value_code`,
    t0.`name`,
    ROUND(t0.`value`, 4) AS `value`
  FROM t0
), t4 AS (
  SELECT
    t7.`id`,
    t7.`date`,
    t7.`name`,
    2 AS `value_id`,
    t7.`B` AS `value`
  FROM `X2` AS t7
), t5 AS (
  SELECT
    t7.`id`,
    t7.`date`,
    t7.`name`,
    1 AS `value_id`,
    t7.`A` AS `value`
  FROM `X1` AS t7
), t6 AS (
  SELECT
    t7.`id`,
    t7.`date`,
    t7.`name`,
    3 AS `value_id`,
    t7.`C` AS `value`
  FROM `X3` AS t7
)
SELECT
  t2.*
FROM (
  SELECT
    t1.*
  FROM t1
  WHERE
    (
      t1.`date` >= '2023-01-01'
    ) AND (
      NOT t1.`value` IS NULL
    )
) AS t2
ORDER BY
  t2.`id` ASC,
  t2.`date` DESC

cpcloud added a commit to kszucs/ibis that referenced this issue Dec 18, 2023
@cpcloud cpcloud added chained joins The bane of Ibis's existence and removed not reproducible A bug report or issue than cannot be reproduced as reported labels Dec 18, 2023
@cpcloud cpcloud linked a pull request Dec 18, 2023 that will close this issue
@cpcloud
Copy link
Member

cpcloud commented Dec 18, 2023

@randompark Thank you!

We're undertaking a big internals refactor right now, and I've added a passing test for this case to #7580.

@chris-park
Copy link
Contributor Author

Hi Philip - just out of curiosity, what might be causing this behaviour? the repetitive and redundant aliasing (as t7) of the table in each component CTE is also odd.

@contang0
Copy link

contang0 commented Jan 4, 2024

I don't have an example right now, but I did notice something similar. Once the expression grows beyond certain depth I was getting errors related to what looked like erroneous aliases. I usually solve this by persisting the intermediate steps rather than building very large expressions.

@cpcloud
Copy link
Member

cpcloud commented Jan 4, 2024

@contang0 No need for an example. This issue will be fixed in 8.0 or 9.0 (likely 9.0). We've already added a test case for it in kszucs@d9591f7

@jcrist
Copy link
Member

jcrist commented Jan 26, 2024

Since this has already been resolved in the-epic-split branch, I'm going to close this.

@jcrist jcrist closed this as completed Jan 26, 2024
@github-project-automation github-project-automation bot moved this from backlog to done in Ibis planning and roadmap Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis chained joins The bane of Ibis's existence impala The Apache Impala backend
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants