Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How will cudf handle df.assign(df.col('a').sort())? #344

Closed
MarcoGorelli opened this issue Jan 12, 2024 · 10 comments
Closed

How will cudf handle df.assign(df.col('a').sort())? #344

MarcoGorelli opened this issue Jan 12, 2024 · 10 comments
Labels
question Further information is requested

Comments

@MarcoGorelli
Copy link
Contributor

It's been brought up that cudf won't implement a separate namespace with the standard, but rather that their main namespace will be compliant

Just for my understanding, what do you intend to do about df.assign(df.col('a').sort())?

pandas behaviour: do nothing, because it aligns on the index

In [4]: df
Out[4]:
   a
0  4
1  1
2  2

In [5]: df.assign(b=lambda x: x['a'].sort_values())
Out[5]:
   a  b
0  4  4
1  1  1
2  2  2

Standard behaviour: insert the sorted values:

In [1]: df = pd.DataFrame({'a': [4, 1, 2]})

In [2]: dfs = df.__dataframe_consortium_standard__()

In [3]: dfs.assign(dfs.col('a').sort().rename('b')).dataframe
Out[3]:
   a  b
0  4  1
1  1  2
2  2  4

@shwina @kkraus14

@MarcoGorelli
Copy link
Contributor Author

@shwina @kkraus14 gentle ping, this is quite important

@MarcoGorelli MarcoGorelli added the question Further information is requested label Jan 23, 2024
@shwina
Copy link
Contributor

shwina commented Jan 23, 2024

Apologies for missing this Marco. Yeah great observation: I expect that recent additions to the standard such as lazy semantics and column expressions are going to need cuDF also to have a separate namespace to support the standard.

@MarcoGorelli
Copy link
Contributor Author

We don't have column expressions in the Standard

But now that you bring it up...let's try this again #346

@kkraus14
Copy link
Collaborator

Since we disallowed operations across columns from different dataframes and I imagine the sorting operation would invalidate the parent dataframe, wouldn't doing something like:

dfs.col('a') + (dfs.col('a').sort())

be disallowed? If that's the case why would we allow assigning it to a DataFrame without using a join explicitly?

@MarcoGorelli
Copy link
Contributor Author

I imagine the sorting operation would invalidate the parent dataframe

Incorrect. It's only the column that's being sorted, not the whole dataframe. dfs.col('a') and dfs.col('a').sort() have the same parent dataframe.

@kkraus14
Copy link
Collaborator

But logically wouldn't we be adding a column with a defined order to a dataframe with arbitrary / undefined order and we disallowed cross dataframe column operations to avoid this specific situation?

There's no way to express this in SQL without a join on a row number for example.

Trying to do this in Ibis errors with the DuckDB backend and the PySpark backend, returns incorrect results with the Pandas backend, and works with the Polars backend:

import ibis
from ibis import _

con = ibis.${BACKEND}.connect()  #duckdb, pyspark, pandas, polars, etc.
con.create_table("table", ibis.memtable({"a": [1, 2, 3], "b": [4, 5, 6]})
table = con.table("table")

table.mutate(c = _.a.desc()).execute()

Trying to do this with PySpark's DataFrame API directly errors that it can't generate code for the expression:

from pyspark.sql import Row, SparkSession

spark = SparkSession.builder.getOrCreate()
table = spark.createDataFrame([
    Row(a = 1, b = 4),
    Row(a = 2, b = 5),
    Row(a = 3, b = 6)
])

table.withColumn("c", table["a"].desc()).collect()

It seems like only Polars supports this in the way you're describing?

@MarcoGorelli
Copy link
Contributor Author

Could you please give an example of Column.sort which you think should be supported

@MarcoGorelli
Copy link
Contributor Author

Anyway, the amount of confusion here underscores just how complicated, difficult to teach, and unusable the current API is

Rather than trying to shoehorn about 4 different concepts into just two classes, there is a simpler way forwards: #346

Then, instead of all these "parent dataframe" complications, we can just say "index alignment is unsupported - join dataframes before hand".

@MarcoGorelli
Copy link
Contributor Author

Going back to #347 (comment) - if you know that df has a defined order, then df.assign(df.col('a').sort().rename('b')) is well-defined, right?

@kkraus14
Copy link
Collaborator

kkraus14 commented Jan 25, 2024

Going back to #347 (comment) - if you know that df has a defined order, then df.assign(df.col('a').sort().rename('b')) is well-defined, right?

I would argue it is, yes. But so is using any other arbitrary column that doesn't necessarily need to originate from df that has a defined order.

Implementations don't implicitly support this though, where it would require doing something like adding an explicit join on something like a virtual row number.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants