Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: maximum recursion depth with join operation #7124

Closed
1 task done
nehanene15 opened this issue Sep 11, 2023 · 23 comments · Fixed by #7148
Closed
1 task done

bug: maximum recursion depth with join operation #7124

nehanene15 opened this issue Sep 11, 2023 · 23 comments · Fixed by #7148
Labels
bug Incorrect behavior inside of ibis

Comments

@nehanene15
Copy link

What happened?

We're seeing a RecursionError: maximum recursion depth exceeded while calling a Python object when running a JOIN: source_difference = source.join(differences, join_keys, how="outer")
Both 'source' and 'differences' are pandas.Table()s with many columns (~120).

We don't hit this error with smaller, less wide tables. I've provided a abridged version of the stack trace below - it does look like there is a cyclical portion of the code when testing if left and right tables have a common parent expr here.

Trying to understand if this is a Python limitation due to how wide the table is, or an Ibis bug. Appreciate the help!

What version of ibis are you using?

5.1.0

What backend(s) are you using, if any?

Pandas

Relevant log output

Traceback (most recent call last):
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/ibis/common/grounds.py", line 210, in __cached_equals__
    result = self.__cache__[key]
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/ibis/common/caching.py", line 46, in __getitem__
    value, _ = self._data[identifiers]
KeyError: (139856786236240, 139856788868000)

During handling of the above exception, another exception occurred:
...
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/ibis/common/grounds.py", line 210, in __cached_equals__
    result = self.__cache__[key]
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/ibis/common/caching.py", line 46, in __getitem__
    value, _ = self._data[identifiers]
KeyError: (139856787188608, 139856789577136)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/x/home/user/new_fl/dvt4/bin/data-validation", line 11, in <module>
    load_entry_point('google-pso-data-validator==4.1.0', 'console_scripts', 'data-validation')()
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/data_validation/__main__.py", line 581, in main
    run_validation_configs(args)
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/data_validation/__main__.py", line 551, in run_validation_configs
    config_runner(args)
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/data_validation/__main__.py", line 304, in config_runner
    run_validations(args, config_managers)
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/data_validation/__main__.py", line 478, in run_validations
    run_validation(
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/data_validation/__main__.py", line 461, in run_validation
    validator.execute()
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/data_validation/data_validation.py", line 96, in execute
    result_df = self._execute_validation(
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/data_validation/data_validation.py", line 314, in _execute_validation
    result_df = combiner.generate_report(
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/data_validation/combiner.py", line 83, in generate_report
    joined = _join_pivots(source_pivot, target_pivot, differences_pivot, join_on_fields)
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/data_validation/combiner.py", line 317, in _join_pivots
    source_difference = source.join(differences, join_keys, how="outer")[
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/ibis/expr/types/relations.py", line 2497, in join
    expr = klass(left, right, predicates).to_expr()
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/ibis/common/grounds.py", line 25, in __call__
    return cls.__create__(*args, **kwargs)
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/ibis/common/grounds.py", line 99, in __create__
    return super().__create__(**kwargs)
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/ibis/common/grounds.py", line 33, in __create__
    return type.__call__(cls, *args, **kwargs)
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/ibis/expr/operations/relations.py", line 178, in __init__
    if left.equals(right):
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/ibis/expr/operations/core.py", line 24, in equals
    return self.__cached_equals__(other)
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/ibis/common/grounds.py", line 212, in __cached_equals__
    result = self.__equals__(other)
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/ibis/common/grounds.py", line 239, in __equals__
    return self.__args__ == other.__args__
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/ibis/common/grounds.py", line 187, in __eq__
    return self.__cached_equals__(other)
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/ibis/common/grounds.py", line 212, in __cached_equals__
    result = self.__equals__(other)
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/ibis/common/grounds.py", line 239, in __equals__
    return self.__args__ == other.__args__
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/ibis/common/grounds.py", line 187, in __eq__
    return self.__cached_equals__(other)
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/ibis/common/grounds.py", line 212, in __cached_equals__
    result = self.__equals__(other)
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/ibis/common/grounds.py", line 239, in __equals__
    return self.__args__ == other.__args__
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/ibis/common/grounds.py", line 187, in __eq__
    return self.__cached_equals__(other)
File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/ibis/common/grounds.py", line 210, in __cached_equals__
    result = self.__cache__[key]
  File "/x/home/user/new_fl/dvt4/lib/python3.8/site-packages/ibis/common/caching.py", line 45, in __getitem__
    identifiers = tuple(id(item) for item in key)
RecursionError: maximum recursion depth exceeded while calling a Python object

Code of Conduct

  • I agree to follow this project's Code of Conduct
@nehanene15 nehanene15 added the bug Incorrect behavior inside of ibis label Sep 11, 2023
@cpcloud
Copy link
Member

cpcloud commented Sep 11, 2023

Thanks for the report!

Can you show the code that produces the exception?

It'll be easier to write a regression test if we can get the exact case that raises the exception.

@nehanene15
Copy link
Author

This is the line that generates the exception: https://github.com/GoogleCloudPlatform/professional-services-data-validator/blob/47154b4139bf22358359c53fe25bbce44745589f/data_validation/combiner.py#L321

And this is where Pandas tables are instantiated before it gets to the combiner.py: https://github.com/GoogleCloudPlatform/professional-services-data-validator/blob/47154b4139bf22358359c53fe25bbce44745589f/data_validation/data_validation.py#L339

For context, the source and target are SQL query results and the combiner.generate_report() aims to find the differences between the two results, if any, for data validation.

@cpcloud
Copy link
Member

cpcloud commented Sep 11, 2023

Any chance you could dump a parquet file of the left and right tables somewhere?

@nehanene15
Copy link
Author

GitHub won't allow me to upload Parquet, but I'll upload the CSVs of the source and differences. We're running source_difference = source.join(differences, join_keys, how="outer")

source_pivot.csv
differences_pivot.csv

@kszucs
Copy link
Member

kszucs commented Sep 11, 2023

Hi @nehanene15!

Could you please try out with the following patch applied (I assume you are using ibis 6.2):

diff --git a/ibis/common/grounds.py b/ibis/common/grounds.py
index 394bc4ccf..6f5bcd184 100644
--- a/ibis/common/grounds.py
+++ b/ibis/common/grounds.py
@@ -203,6 +203,8 @@ class Comparable(Base):
         if type(self) is not type(other):
             return False

+        return self.__equals__(other)
+
         # reduce space required for commutative operation
         if id(self) < id(other):
             key = (self, other)

This way we turn of an optimization which maintains a global cache for operation node equality checks. If it keeps failing we could get a clearer traceback.

@kszucs
Copy link
Member

kszucs commented Sep 11, 2023

Another option would be to pickle the left and right arguments in case of a recursion error using the following snippet:

diff --git a/ibis/expr/operations/relations.py b/ibis/expr/operations/relations.py
index c444a7d88..a12ded3b0 100644
--- a/ibis/expr/operations/relations.py
+++ b/ibis/expr/operations/relations.py
@@ -220,6 +220,15 @@ class Join(TableNode):
             for pred in util.promote_list(predicates)
         ]

+        try:
+            left.equals(right)
+        except RecursionError:
+            import pickle
+            with open('left.pickle', 'wb') as fp:
+                pickle.dump(left, fp)
+            with open('right.pickle', 'wb') as fp:
+                pickle.dump(right, fp)
+
         if left.equals(right):
             # GH #667: If left and right table have a common parent expression,
             # e.g. they have different filters, we need to add a self-reference

Then I could try to load the two objects to reproduce the error.

@cpcloud
Copy link
Member

cpcloud commented Sep 12, 2023

@nehanene15 Can you show the value of join_keys?

@cpcloud
Copy link
Member

cpcloud commented Sep 12, 2023

@kszucs The bug report says @nehanene15 is using 5.1.0, if that helps debug at all.

I am looking into it as well, to see if it may have been fixed already in master.

@cpcloud
Copy link
Member

cpcloud commented Sep 12, 2023

I am not able to get the following test to fail on 5.1.0 or master:

def test_large_join():
    source = pd.read_csv(
        "https://github.com/ibis-project/ibis/files/12580336/source_pivot.csv",
        index_col=0,
    )
    diffs = pd.read_csv(
        "https://github.com/ibis-project/ibis/files/12580340/differences_pivot.csv",
        index_col=0,
    )
    con = ibis.pandas.connect({"source": source, "diffs": diffs})
    source = con.tables.source
    diffs = con.tables.diffs

    join_keys = set(source.columns) & set(diffs.columns)
    join = source.join(diffs, join_keys, how="outer").select(
        [source[key] for key in join_keys]
        + [
            source["validation_type"],
            source["aggregation_type"],
            source["table_name"],
            source["column_name"],
            source["primary_keys"],
            source["num_random_rows"],
            source["agg_value"],
            diffs["difference"],
            diffs["pct_difference"],
            diffs["pct_threshold"],
            diffs["validation_status"],
        ],
    )
    df = join.execute()
    assert not df.empty

@nehanene15 Any ideas?

@nehanene15
Copy link
Author

Hmm.. the join_keys value is ('validation_name',).

@kszucs When I try the patch, I get a similar cyclical error:

  File "/Users/nehanene/Projects/professional-services-data-validator/env/lib/python3.8/site-packages/ibis/common/grounds.py", line 203, in __cached_equals__
    return self.__equals__(other)
  File "/Users/nehanene/Projects/professional-services-data-validator/env/lib/python3.8/site-packages/ibis/common/grounds.py", line 241, in __equals__
    return self.__args__ == other.__args__
  File "/Users/nehanene/Projects/professional-services-data-validator/env/lib/python3.8/site-packages/ibis/common/grounds.py", line 187, in __eq__
    return self.__cached_equals__(other)
  File "/Users/nehanene/Projects/professional-services-data-validator/env/lib/python3.8/site-packages/ibis/common/grounds.py", line 203, in __cached_equals__
    return self.__equals__(other)
  File "/Users/nehanene/Projects/professional-services-data-validator/env/lib/python3.8/site-packages/ibis/common/grounds.py", line 241, in __equals__
    return self.__args__ == other.__args__
  File "/Users/nehanene/Projects/professional-services-data-validator/env/lib/python3.8/site-packages/ibis/common/grounds.py", line 187, in __eq__
    return self.__cached_equals__(other)
  File "/Users/nehanene/Projects/professional-services-data-validator/env/lib/python3.8/site-packages/ibis/common/grounds.py", line 203, in __cached_equals__
    return self.__equals__(other)
  File "/Users/nehanene/Projects/professional-services-data-validator/env/lib/python3.8/site-packages/ibis/common/grounds.py", line 241, in __equals__
    return self.__args__ == other.__args__
RecursionError: maximum recursion depth exceeded in comparison

@nehanene15
Copy link
Author

When I print(source), I get a large Ibis expression like below which might be the issue. I might need to try executing the expression before doing the source.join()

r356 := Selection[r355]
  selections:
    validation_name:  r355.validation_name
    validation_type:  r355.validation_type
    aggregation_type: r355.aggregation_type
    table_name:       r355.table_name
    column_name:      r355.column_name
    primary_keys:     r355.primary_keys
    num_random_rows:  r355.num_random_rows
    agg_value:        r355.agg_value

r357 := Union[r356, r10, distinct=False]

r358 := Selection[r357]
  selections:
    validation_name:  r357.validation_name
    validation_type:  r357.validation_type
    aggregation_type: r357.aggregation_type
    table_name:       r357.table_name
    column_name:      r357.column_name
    primary_keys:     r357.primary_keys
    num_random_rows:  r357.num_random_rows
    agg_value:        r357.agg_value

r359 := Union[r358, r9, distinct=False]

r360 := Selection[r359]
  selections:
    validation_name:  r359.validation_name
    validation_type:  r359.validation_type
    aggregation_type: r359.aggregation_type
    table_name:       r359.table_name
    column_name:      r359.column_name
    primary_keys:     r359.primary_keys
    num_random_rows:  r359.num_random_rows
    agg_value:        r359.agg_value

r361 := Union[r360, r8, distinct=False]

r362 := Selection[r361]
  selections:
    validation_name:  r361.validation_name
    validation_type:  r361.validation_type
    aggregation_type: r361.aggregation_type
    table_name:       r361.table_name
    column_name:      r361.column_name
    primary_keys:     r361.primary_keys
    num_random_rows:  r361.num_random_rows
    agg_value:        r361.agg_value

r363 := Union[r362, r7, distinct=False]

r364 := Selection[r363]
  selections:
    validation_name:  r363.validation_name
    validation_type:  r363.validation_type
    aggregation_type: r363.aggregation_type
    table_name:       r363.table_name
    column_name:      r363.column_name
    primary_keys:     r363.primary_keys
    num_random_rows:  r363.num_random_rows
    agg_value:        r363.agg_value

r365 := Union[r364, r6, distinct=False]

r366 := Selection[r365]
  selections:
    validation_name:  r365.validation_name
    validation_type:  r365.validation_type
    aggregation_type: r365.aggregation_type
    table_name:       r365.table_name
    column_name:      r365.column_name
    primary_keys:     r365.primary_keys
    num_random_rows:  r365.num_random_rows
    agg_value:        r365.agg_value

r367 := Union[r366, r5, distinct=False]

r368 := Selection[r367]
  selections:
    validation_name:  r367.validation_name
    validation_type:  r367.validation_type
    aggregation_type: r367.aggregation_type
    table_name:       r367.table_name
    column_name:      r367.column_name
    primary_keys:     r367.primary_keys
    num_random_rows:  r367.num_random_rows
    agg_value:        r367.agg_value

r369 := Union[r368, r4, distinct=False]

Selection[r369]
  selections:
    validation_name:  r369.validation_name
    validation_type:  r369.validation_type
    aggregation_type: r369.aggregation_type
    table_name:       r369.table_name
    column_name:      r369.column_name
    primary_keys:     r369.primary_keys
    num_random_rows:  r369.num_random_rows
    agg_value:        r369.agg_value

@cpcloud
Copy link
Member

cpcloud commented Sep 12, 2023

Ah, yeah it looks like there's around 370 tables in the mix there. There's nothing in principle preventing that, but it seems like it's related.

If you can pickle the unbound expression and dump that somewhere then we can probably reproduce it.

In the meantime, I will try to construct a big union of tables to see if I can reproduce this.

@cpcloud
Copy link
Member

cpcloud commented Sep 12, 2023

Ok, I can reproduce it with this

def test_large_join():
    source = pd.read_csv(
        "https://github.com/ibis-project/ibis/files/12580336/source_pivot.csv",
        index_col=0,
    )
    diffs = pd.read_csv(
        "https://github.com/ibis-project/ibis/files/12580340/differences_pivot.csv",
        index_col=0,
    )
    con = ibis.pandas.connect({"source": source, "diffs": diffs})
    n = 200
    source = ibis.union(*[con.tables.source for _ in range(n)])
    diffs = ibis.union(*[con.tables.diffs for _ in range(n)])

    join_keys = set(source.columns) & set(diffs.columns)
    join = source.join(diffs, join_keys, how="outer").select(
        [source[key] for key in join_keys]
        + [
            source["validation_type"],
            source["aggregation_type"],
            source["table_name"],
            source["column_name"],
            source["primary_keys"],
            source["num_random_rows"],
            source["agg_value"],
            diffs["difference"],
            diffs["pct_difference"],
            diffs["pct_threshold"],
            diffs["validation_status"],
        ],
    )
    df = join.execute()
    assert not df.empty

@nehanene15
Copy link
Author

It works if I execute the large expr before doing the join!

In this case differences_pivot and source_pivot are the large expressions with around 495 tables in the mix.
Working code:

differences_df = client.execute(differences_pivot)
source_df = client.execute(source_pivot)

con = ibis.pandas.connect({"source": source, "differences": differences, "target": target})
source = con.tables.source
differences = con.tables.differences

source_difference = source.join(differences, join_keys, how="outer")[
        [source[field] for field in join_keys]
        + [
            source["validation_type"],
            source["aggregation_type"],
            source["table_name"],
            source["column_name"],
            source["primary_keys"],
            source["num_random_rows"],
            source["agg_value"],
            differences["difference"],
            differences["pct_difference"],
            differences["pct_threshold"],
            differences["validation_status"],
        ]
    ]

@cpcloud
Copy link
Member

cpcloud commented Sep 12, 2023

It's failing for the same reason in that test, but at a slightly different location (the execute call)

@nehanene15
Copy link
Author

I see. Seems like it's best practice to execute the Ibis expr beforehand to avoid the 300+ table union/join so I'll update our code to reflect that if you agree.

@cpcloud
Copy link
Member

cpcloud commented Sep 12, 2023

@kszucs I suspect we can construct a failing example without joins.

I suspect that there may be some propertys we should turn into attributes in a few places, to avoid huge traversals, for example the schema attribute of Unions.

We can probably also avoid storing a huge tree for set operations

@cpcloud
Copy link
Member

cpcloud commented Sep 12, 2023

@nehanene15 I think for your case it's a viable workaround, but I don't think it's best practice 😄, I think it's a bug in ibis that we will try to address.

@kszucs
Copy link
Member

kszucs commented Sep 12, 2023

I suspect that there may be some propertys we should turn into attributes in a few places, to avoid huge traversals, for example the schema attribute of Unions.

I was thinking of the same, I'm not sure how could we prevent call stacks like this, but we can certainly "delay" their occurrence.

@cpcloud
Copy link
Member

cpcloud commented Sep 12, 2023

Another thing that may help decrease call stack size is changing the representation of SetOp to be variadic.

@cpcloud
Copy link
Member

cpcloud commented Sep 13, 2023

@nehanene15 Can you try your code against #7148? That should give you some breathing room for huge unions, though see the PR description (points 3 and 4) that might explain any new issues that look similar 😅

@nehanene15
Copy link
Author

@cpcloud This definitely gives more wiggle room in addition to executing the ibis expr before the joins.

@cpcloud
Copy link
Member

cpcloud commented Sep 15, 2023

@nehanene15 You should have plenty of room for those big unions now :)

If anything else pops up don't hesitate to open another issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants