-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: maximum recursion depth with join operation #7124
Comments
Thanks for the report! Can you show the code that produces the exception? It'll be easier to write a regression test if we can get the exact case that raises the exception. |
This is the line that generates the exception: https://github.com/GoogleCloudPlatform/professional-services-data-validator/blob/47154b4139bf22358359c53fe25bbce44745589f/data_validation/combiner.py#L321 And this is where Pandas tables are instantiated before it gets to the combiner.py: https://github.com/GoogleCloudPlatform/professional-services-data-validator/blob/47154b4139bf22358359c53fe25bbce44745589f/data_validation/data_validation.py#L339 For context, the source and target are SQL query results and the combiner.generate_report() aims to find the differences between the two results, if any, for data validation. |
Any chance you could dump a parquet file of the left and right tables somewhere? |
GitHub won't allow me to upload Parquet, but I'll upload the CSVs of the source and differences. We're running |
Hi @nehanene15! Could you please try out with the following patch applied (I assume you are using ibis 6.2): diff --git a/ibis/common/grounds.py b/ibis/common/grounds.py
index 394bc4ccf..6f5bcd184 100644
--- a/ibis/common/grounds.py
+++ b/ibis/common/grounds.py
@@ -203,6 +203,8 @@ class Comparable(Base):
if type(self) is not type(other):
return False
+ return self.__equals__(other)
+
# reduce space required for commutative operation
if id(self) < id(other):
key = (self, other) This way we turn of an optimization which maintains a global cache for operation node equality checks. If it keeps failing we could get a clearer traceback. |
Another option would be to pickle the diff --git a/ibis/expr/operations/relations.py b/ibis/expr/operations/relations.py
index c444a7d88..a12ded3b0 100644
--- a/ibis/expr/operations/relations.py
+++ b/ibis/expr/operations/relations.py
@@ -220,6 +220,15 @@ class Join(TableNode):
for pred in util.promote_list(predicates)
]
+ try:
+ left.equals(right)
+ except RecursionError:
+ import pickle
+ with open('left.pickle', 'wb') as fp:
+ pickle.dump(left, fp)
+ with open('right.pickle', 'wb') as fp:
+ pickle.dump(right, fp)
+
if left.equals(right):
# GH #667: If left and right table have a common parent expression,
# e.g. they have different filters, we need to add a self-reference Then I could try to load the two objects to reproduce the error. |
@nehanene15 Can you show the value of |
@kszucs The bug report says @nehanene15 is using 5.1.0, if that helps debug at all. I am looking into it as well, to see if it may have been fixed already in |
I am not able to get the following test to fail on 5.1.0 or def test_large_join():
source = pd.read_csv(
"https://github.com/ibis-project/ibis/files/12580336/source_pivot.csv",
index_col=0,
)
diffs = pd.read_csv(
"https://github.com/ibis-project/ibis/files/12580340/differences_pivot.csv",
index_col=0,
)
con = ibis.pandas.connect({"source": source, "diffs": diffs})
source = con.tables.source
diffs = con.tables.diffs
join_keys = set(source.columns) & set(diffs.columns)
join = source.join(diffs, join_keys, how="outer").select(
[source[key] for key in join_keys]
+ [
source["validation_type"],
source["aggregation_type"],
source["table_name"],
source["column_name"],
source["primary_keys"],
source["num_random_rows"],
source["agg_value"],
diffs["difference"],
diffs["pct_difference"],
diffs["pct_threshold"],
diffs["validation_status"],
],
)
df = join.execute()
assert not df.empty @nehanene15 Any ideas? |
Hmm.. the join_keys value is @kszucs When I try the patch, I get a similar cyclical error:
|
When I
|
Ah, yeah it looks like there's around 370 tables in the mix there. There's nothing in principle preventing that, but it seems like it's related. If you can pickle the unbound expression and dump that somewhere then we can probably reproduce it. In the meantime, I will try to construct a big union of tables to see if I can reproduce this. |
Ok, I can reproduce it with this def test_large_join():
source = pd.read_csv(
"https://github.com/ibis-project/ibis/files/12580336/source_pivot.csv",
index_col=0,
)
diffs = pd.read_csv(
"https://github.com/ibis-project/ibis/files/12580340/differences_pivot.csv",
index_col=0,
)
con = ibis.pandas.connect({"source": source, "diffs": diffs})
n = 200
source = ibis.union(*[con.tables.source for _ in range(n)])
diffs = ibis.union(*[con.tables.diffs for _ in range(n)])
join_keys = set(source.columns) & set(diffs.columns)
join = source.join(diffs, join_keys, how="outer").select(
[source[key] for key in join_keys]
+ [
source["validation_type"],
source["aggregation_type"],
source["table_name"],
source["column_name"],
source["primary_keys"],
source["num_random_rows"],
source["agg_value"],
diffs["difference"],
diffs["pct_difference"],
diffs["pct_threshold"],
diffs["validation_status"],
],
)
df = join.execute()
assert not df.empty |
It works if I execute the large expr before doing the join! In this case differences_pivot and source_pivot are the large expressions with around 495 tables in the mix.
|
It's failing for the same reason in that test, but at a slightly different location (the |
I see. Seems like it's best practice to execute the Ibis expr beforehand to avoid the 300+ table union/join so I'll update our code to reflect that if you agree. |
@kszucs I suspect we can construct a failing example without joins. I suspect that there may be some We can probably also avoid storing a huge tree for set operations |
@nehanene15 I think for your case it's a viable workaround, but I don't think it's best practice 😄, I think it's a bug in ibis that we will try to address. |
I was thinking of the same, I'm not sure how could we prevent call stacks like this, but we can certainly "delay" their occurrence. |
Another thing that may help decrease call stack size is changing the representation of SetOp to be variadic. |
@nehanene15 Can you try your code against #7148? That should give you some breathing room for huge unions, though see the PR description (points 3 and 4) that might explain any new issues that look similar 😅 |
@cpcloud This definitely gives more wiggle room in addition to executing the ibis expr before the joins. |
@nehanene15 You should have plenty of room for those big unions now :) If anything else pops up don't hesitate to open another issue! |
What happened?
We're seeing a
RecursionError: maximum recursion depth exceeded while calling a Python object
when running a JOIN:source_difference = source.join(differences, join_keys, how="outer")
Both 'source' and 'differences' are pandas.Table()s with many columns (~120).
We don't hit this error with smaller, less wide tables. I've provided a abridged version of the stack trace below - it does look like there is a cyclical portion of the code when testing if left and right tables have a common parent expr here.
Trying to understand if this is a Python limitation due to how wide the table is, or an Ibis bug. Appreciate the help!
What version of ibis are you using?
5.1.0
What backend(s) are you using, if any?
Pandas
Relevant log output
Code of Conduct
The text was updated successfully, but these errors were encountered: