-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: inlining expressions leads to wrong results for non-pure functions #8921
Comments
We can add these impure types to https://github.com/ibis-project/ibis/blob/main/ibis/backends/sql/rewrites.py#L155 in order to prevent merging those selects. |
That would work, if we are sure we can get all the types. What about ops.ArrayDistinct though? In duckdb that is an impure function, because the order of the result is not guaranteed. But in other backends it might be a pure function. Do we just have to be conservative and call that a blocking op? |
What do you think is the counter-argument against only merging on re-selects? Is it just the uglier/more verbose SQL? Unless we have evidence that it is slower, I would want to lean towards the less-likely-to-be-incorrect route, even if the generated SQL is uglier for human consumption. |
The problem doesn't just occur in case of merging In [1]: import ibis
In [2]: uid = ibis.uuid()
In [3]: t = ibis.table({"a": "int", "b": "string"})
In [4]: t1 = t.select(t, uid)
In [7]: t1 = t.select(t, uid=uid)
In [8]: t2 = t1.select(uid)
In [9]: t2
Out[9]:
r0 := UnboundTable: unbound_table_0
a int64
b string
r1 := Project[r0]
a: r0.a
b: r0.b
uid: RandomUUID()
Project[r1]
RandomUUID(): r1.uid
In [10]: t3 = t1.select(ibis.uuid())
In [11]: t3
Out[11]:
r0 := UnboundTable: unbound_table_0
a int64
b string
r1 := Project[r0]
a: r0.a
b: r0.b
uid: RandomUUID()
Project[r1]
RandomUUID(): r1.uid This "rewrite" happens at the API level as well during value dereferencing which we certainly would like to keep. In the first case we use the same variable so we mean the same value whereas in the latter we create a new uuid assuming that the user intention is to create a new one. I think this issue can be resolved by better modelling of inpure expressions like class Unique(Value):
_counter = itertools.count()
uid: Optional[int] = None
def __init__(self, uid, **kwargs):
if uid is None:
uid = next(self._counter)
super().__init__(uid=uid, **kwargs)
class RandomUUID(Unique):
dtype = dt.uuid
class RandomScalar(Unique):
dtype = dt.float64 |
What's the issue with disabling the merging? Otherwise we'll be playing whack-a-mole with discovering impure functions. |
Disabling merging would solve the original problem in the issue but wouldn't solve the dereferencing one shown by me. My proposed solution would solve both. |
Sorry I went on vacation for a few days, therefore radio silence. I'm hesitant that the linked PR wasn't the right solution:
If I wanted y and z to be correlated, I would have referenced the parent column directly: impure = func()
t = ibis.table({"x": "int64"}, name="t")
t1 = t.mutate(y=impure)
t2 = t1.mutate(z=t1.y.cast("string"))
expected = ops.Project(
parent=t1,
values={"x": t1.x, "y": t1.y, "z": t1.y.cast("string")},
) If I don't reference the parent table in subsequent selects, then I expect the columns to be uncorrelated: impure = func()
t = ibis.table({"x": "int64"}, name="t")
t1 = t.mutate(y=impure)
t2 = t1.mutate(z=impure.cast("string"))
expected = ops.Project(
parent=t1,
values={"x": t1.x, "y": t1.y, "z": impure.cast("string")},
) I think we need to rely on the user doing explicit .selects and .mutates to be the boundary locations where expressions are separated as correlated/uncorrelated. |
I see what you're saying here, that does look like a problem. Just to be sure here, what I would expect is for both cases to be something like the very simple
(ie there is literally NO reference to the parent tables, because we only selected a single UUID expression, which doesn't depend on any parents). Is this what you would expect too? |
Well, yeah :) See my comment here: #8967 (comment) |
In this comment, @kszucs says: # semantically this should produce two different columns
t.select(x=ibis.random(), y=ibis.random())
# semantically this should produce identical columns
rand = ibis.random()
t.select(x=rand, y=rand) I think that both of these should actually produce two different columns. If you want to get two identical columns, I think you should have to explicitly re-select from a parent relation: t.select(common=ibis.random()).select(x=_.common, y=_.common) Consider this example (maybe I'm not understanding the implementation, correct me if so): # @functools.cache()
def generate_noise(amplitude):
return (ibis.random() - .5) * amplitude
def add_noise(t, columns: list[str]):
return t.mutate(**{col:t[col] + generate_noise(100) for col in columns}) If there is a |
Need to fix the UDF test case. Related to ibis-project#8921, trying to write down exactly what the expected behavior is.
Need to fix the UDF test case. Related to ibis-project#8921, trying to write down exactly what the expected behavior is.
@kszucs Perhaps we should consider restricting impurity further, to effectively require a projection to get the same value. I agree with Nick that A variable isn't the same as a column in a projection. |
I can accept that as the desired behaviour, though we need to prevent dereferencing in that case. Going to take a look. |
Need to fix the UDF test case. Related to ibis-project#8921, trying to write down exactly what the expected behavior is.
This comment is maybe getting towards the use cases we need to support? |
Need to fix the UDF test case. Related to ibis-project#8921, trying to write down exactly what the expected behavior is.
Need to fix the UDF test case. Related to ibis-project#8921, trying to write down exactly what the expected behavior is.
Need to fix the a few broken cases. Related to ibis-project#8921, trying to write down exactly what the expected behavior is.
Need to fix the a few broken cases. Related to ibis-project#8921, trying to write down exactly what the expected behavior is.
Need to fix the a few broken cases. Related to ibis-project#8921, trying to write down exactly what the expected behavior is.
Need to fix the a few broken cases. Related to ibis-project#8921, trying to write down exactly what the expected behavior is.
Need to fix the a few broken cases. Related to ibis-project#8921, trying to write down exactly what the expected behavior is.
Need to fix the a few broken cases. Related to ibis-project#8921, trying to write down exactly what the expected behavior is.
Related to ibis-project#8921, trying to write down exactly what the expected behavior is.
Related to ibis-project#8921, trying to write down exactly what the expected behavior is.
Related to ibis-project#8921, trying to write down exactly what the expected behavior is.
Related to ibis-project#8921, trying to write down exactly what the expected behavior is.
Related to ibis-project#8921, trying to write down exactly what the expected behavior is.
Related to ibis-project#8921, trying to write down exactly what the expected behavior is.
Related to ibis-project#8921, trying to write down exactly what the expected behavior is.
Related to ibis-project#8921, trying to write down exactly what the expected behavior is.
What happened?
See this:
Notice that the
size
column is sometimes incorrect. This is because size is getting calculated from a totally uncorrelatedRANDOM()
in the generated SQL.ibis.to_sql(t)
shows:For this to work as expected, there can only be a single
RANDOM()
in the generated SQL. I think that the agressive inlining that we do to merge selects/mutates together is over-aggressive. After a short talk with @cpcloud, this inlining is only correct when the entire expression is pure. Either we need to somehow know when an expression is pure (most are, except for (some) UDFs, random(), maybe now(), and maybe others I'm not thinking of?), or just be conservative and only do this inlining when the subsequent selects/mutates are just reselecting/renaming columns with no other changes.What version of ibis are you using?
main
What backend(s) are you using, if any?
I'm not sure if this is present on all backends, but I think so.
Relevant log output
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: