-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How will cudf handle df.assign(df.col('a').sort())
?
#344
Comments
Apologies for missing this Marco. Yeah great observation: I expect that recent additions to the standard such as lazy semantics and column expressions are going to need cuDF also to have a separate namespace to support the standard. |
We don't have column expressions in the Standard But now that you bring it up...let's try this again #346 |
Since we disallowed operations across columns from different dataframes and I imagine the sorting operation would invalidate the parent dataframe, wouldn't doing something like:
be disallowed? If that's the case why would we allow assigning it to a DataFrame without using a join explicitly? |
Incorrect. It's only the column that's being sorted, not the whole dataframe. |
But logically wouldn't we be adding a column with a defined order to a dataframe with arbitrary / undefined order and we disallowed cross dataframe column operations to avoid this specific situation? There's no way to express this in SQL without a join on a row number for example. Trying to do this in Ibis errors with the DuckDB backend and the PySpark backend, returns incorrect results with the Pandas backend, and works with the Polars backend:
Trying to do this with PySpark's DataFrame API directly errors that it can't generate code for the expression:
It seems like only Polars supports this in the way you're describing? |
Could you please give an example of |
Anyway, the amount of confusion here underscores just how complicated, difficult to teach, and unusable the current API is Rather than trying to shoehorn about 4 different concepts into just two classes, there is a simpler way forwards: #346 Then, instead of all these "parent dataframe" complications, we can just say "index alignment is unsupported - join dataframes before hand". |
Going back to #347 (comment) - if you know that |
I would argue it is, yes. But so is using any other arbitrary column that doesn't necessarily need to originate from Implementations don't implicitly support this though, where it would require doing something like adding an explicit join on something like a virtual row number. |
It's been brought up that cudf won't implement a separate namespace with the standard, but rather that their main namespace will be compliant
Just for my understanding, what do you intend to do about
df.assign(df.col('a').sort())
?pandas behaviour: do nothing, because it aligns on the index
Standard behaviour: insert the sorted values:
@shwina @kkraus14
The text was updated successfully, but these errors were encountered: