How will cudf handle `df.assign(df.col('a').sort())`? #344

MarcoGorelli · 2024-01-12T10:14:48Z

It's been brought up that cudf won't implement a separate namespace with the standard, but rather that their main namespace will be compliant

Just for my understanding, what do you intend to do about df.assign(df.col('a').sort())?

pandas behaviour: do nothing, because it aligns on the index

In [4]: df
Out[4]:
   a
0  4
1  1
2  2

In [5]: df.assign(b=lambda x: x['a'].sort_values())
Out[5]:
   a  b
0  4  4
1  1  1
2  2  2

Standard behaviour: insert the sorted values:

In [1]: df = pd.DataFrame({'a': [4, 1, 2]})

In [2]: dfs = df.__dataframe_consortium_standard__()

In [3]: dfs.assign(dfs.col('a').sort().rename('b')).dataframe
Out[3]:
   a  b
0  4  1
1  1  2
2  2  4

@shwina @kkraus14

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2024-01-23T08:38:04Z

@shwina @kkraus14 gentle ping, this is quite important

shwina · 2024-01-23T10:10:31Z

Apologies for missing this Marco. Yeah great observation: I expect that recent additions to the standard such as lazy semantics and column expressions are going to need cuDF also to have a separate namespace to support the standard.

MarcoGorelli · 2024-01-23T11:29:51Z

We don't have column expressions in the Standard

But now that you bring it up...let's try this again #346

kkraus14 · 2024-01-23T18:57:31Z

Since we disallowed operations across columns from different dataframes and I imagine the sorting operation would invalidate the parent dataframe, wouldn't doing something like:

dfs.col('a') + (dfs.col('a').sort())

be disallowed? If that's the case why would we allow assigning it to a DataFrame without using a join explicitly?

MarcoGorelli · 2024-01-23T19:21:02Z

I imagine the sorting operation would invalidate the parent dataframe

Incorrect. It's only the column that's being sorted, not the whole dataframe. dfs.col('a') and dfs.col('a').sort() have the same parent dataframe.

kkraus14 · 2024-01-23T20:16:35Z

But logically wouldn't we be adding a column with a defined order to a dataframe with arbitrary / undefined order and we disallowed cross dataframe column operations to avoid this specific situation?

There's no way to express this in SQL without a join on a row number for example.

Trying to do this in Ibis errors with the DuckDB backend and the PySpark backend, returns incorrect results with the Pandas backend, and works with the Polars backend:

import ibis
from ibis import _

con = ibis.${BACKEND}.connect()  #duckdb, pyspark, pandas, polars, etc.
con.create_table("table", ibis.memtable({"a": [1, 2, 3], "b": [4, 5, 6]})
table = con.table("table")

table.mutate(c = _.a.desc()).execute()

Trying to do this with PySpark's DataFrame API directly errors that it can't generate code for the expression:

from pyspark.sql import Row, SparkSession

spark = SparkSession.builder.getOrCreate()
table = spark.createDataFrame([
    Row(a = 1, b = 4),
    Row(a = 2, b = 5),
    Row(a = 3, b = 6)
])

table.withColumn("c", table["a"].desc()).collect()

It seems like only Polars supports this in the way you're describing?

MarcoGorelli · 2024-01-23T21:57:18Z

Could you please give an example of Column.sort which you think should be supported

MarcoGorelli · 2024-01-24T14:26:57Z

Anyway, the amount of confusion here underscores just how complicated, difficult to teach, and unusable the current API is

Rather than trying to shoehorn about 4 different concepts into just two classes, there is a simpler way forwards: #346

Then, instead of all these "parent dataframe" complications, we can just say "index alignment is unsupported - join dataframes before hand".

MarcoGorelli · 2024-01-25T18:51:39Z

Going back to #347 (comment) - if you know that df has a defined order, then df.assign(df.col('a').sort().rename('b')) is well-defined, right?

kkraus14 · 2024-01-25T19:03:41Z

Going back to #347 (comment) - if you know that df has a defined order, then df.assign(df.col('a').sort().rename('b')) is well-defined, right?

I would argue it is, yes. But so is using any other arbitrary column that doesn't necessarily need to originate from df that has a defined order.

Implementations don't implicitly support this though, where it would require doing something like adding an explicit join on something like a virtual row number.

MarcoGorelli added the question Further information is requested label Jan 23, 2024

MarcoGorelli mentioned this issue Jan 25, 2024

Remove DataFrame.take #347

Closed

MarcoGorelli closed this as completed Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How will cudf handle `df.assign(df.col('a').sort())`? #344

How will cudf handle `df.assign(df.col('a').sort())`? #344

MarcoGorelli commented Jan 12, 2024

MarcoGorelli commented Jan 23, 2024

shwina commented Jan 23, 2024

MarcoGorelli commented Jan 23, 2024

kkraus14 commented Jan 23, 2024

MarcoGorelli commented Jan 23, 2024

kkraus14 commented Jan 23, 2024

MarcoGorelli commented Jan 23, 2024

MarcoGorelli commented Jan 24, 2024

MarcoGorelli commented Jan 25, 2024

kkraus14 commented Jan 25, 2024 •

edited

Loading

How will cudf handle df.assign(df.col('a').sort())? #344

How will cudf handle df.assign(df.col('a').sort())? #344

Comments

MarcoGorelli commented Jan 12, 2024

MarcoGorelli commented Jan 23, 2024

shwina commented Jan 23, 2024

MarcoGorelli commented Jan 23, 2024

kkraus14 commented Jan 23, 2024

MarcoGorelli commented Jan 23, 2024

kkraus14 commented Jan 23, 2024

MarcoGorelli commented Jan 23, 2024

MarcoGorelli commented Jan 24, 2024

MarcoGorelli commented Jan 25, 2024

kkraus14 commented Jan 25, 2024 • edited Loading

How will cudf handle `df.assign(df.col('a').sort())`? #344

How will cudf handle `df.assign(df.col('a').sort())`? #344

kkraus14 commented Jan 25, 2024 •

edited

Loading