Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

map_groups no longer works when passing a list of column names to group_by #14401

Open
2 tasks done
kmcentush opened this issue Feb 10, 2024 · 5 comments
Open
2 tasks done
Labels
bug Something isn't working P-medium Priority: medium python Related to Python Polars regression Issue introduced by a new release

Comments

@kmcentush
Copy link

kmcentush commented Feb 10, 2024

Checks

Reproducible example

import polars as pl

df = pl.from_dict({
    "color": ["red", "red", "red", "blue", "blue"],
    "shape": ["circle", "circle", "square", "square", "square"],
    "value": [1, 2, 3, 4, 5],
})
print(df)


def custom(df: pl.DataFrame) -> pl.DataFrame:
    complex_udf_logic = df["value"].sum()
    return df.with_columns(sum=complex_udf_logic)

def custom2(df: pl.DataFrame) -> pl.DataFrame:
    complex_udf_logic = df["value"].sum()
    return df[0].with_columns(sum=complex_udf_logic)


# Adds value of group to each row; number of rows unchanged
agg = df.group_by(["color", "shape"]).map_groups(custom)
print(agg)

# Returns one row per group
agg2 = df.group_by(["color", "shape"]).map_groups(custom2)
print(agg2)

Log output

shape: (5, 3)
┌───────┬────────┬───────┐
│ color ┆ shape  ┆ value │
│ ---   ┆ ---    ┆ ---   │
│ str   ┆ str    ┆ i64   │
╞═══════╪════════╪═══════╡
│ red   ┆ circle ┆ 1     │
│ red   ┆ circle ┆ 2     │
│ red   ┆ square ┆ 3     │
│ blue  ┆ square ┆ 4     │
│ blue  ┆ square ┆ 5     │
└───────┴────────┴───────┘
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[3], line 21
     17     return df[0].with_columns(sum=complex_udf_logic)
     20 # Adds value of group to each row; number of rows unchanged
---> 21 agg = df.group_by(["color", "shape"]).map_groups(custom)
     22 print(agg)
     24 # Returns one row per group

File ~/mambaforge/envs/cuberg-notebooks/lib/python3.10/site-packages/polars/dataframe/group_by.py:319, in GroupBy.map_groups(self, function)
    317 if not all(isinstance(c, str) for c in self.by):
    318     msg = "cannot call `map_groups` when grouping by an expression"
--> 319     raise TypeError(msg)
    321 return self.df.__class__._from_pydf(
    322     self.df._df.group_by_map_groups(
    323         list(self.by), function, self.maintain_order
    324     )
    325 )

TypeError: cannot call `map_groups` when grouping by an expression

Issue description

Between polars 0.20.6 and 0.20.7, some of my complex UDF maps broke. I was able to recreate a simple example above where the code works in the earlier version but not latest.

Is this breaking change intentional? I didn't see it highlighted in the release notes, but I could be wrong. I believe it's possible for me to refactor my code to use the more recent documentation's suggestion of .agg(pl.map_groups()) setup, but it will take a decent amount of work given the first argument/dtype coming in as a list of series instead of a dataframe.

Aside: I recognize that these demonstrated aggregations do not actually require a UDF; rest-assured that my actual use case is using third-party Python libraries (scipy, etc.) that do optimization and return result(s) - and therefore require a UDF.

Expected behavior

Same exact code above.

Log output from polars 0.20.6:

shape: (5, 3)
┌───────┬────────┬───────┐
│ color ┆ shape  ┆ value │
│ ---   ┆ ---    ┆ ---   │
│ str   ┆ str    ┆ i64   │
╞═══════╪════════╪═══════╡
│ red   ┆ circle ┆ 1     │
│ red   ┆ circle ┆ 2     │
│ red   ┆ square ┆ 3     │
│ blue  ┆ square ┆ 4     │
│ blue  ┆ square ┆ 5     │
└───────┴────────┴───────┘
shape: (5, 4)
┌───────┬────────┬───────┬─────┐
│ color ┆ shape  ┆ value ┆ sum │
│ ---   ┆ ---    ┆ ---   ┆ --- │
│ str   ┆ str    ┆ i64   ┆ i32 │
╞═══════╪════════╪═══════╪═════╡
│ red   ┆ circle ┆ 1     ┆ 3   │
│ red   ┆ circle ┆ 2     ┆ 3   │
│ red   ┆ square ┆ 3     ┆ 3   │
│ blue  ┆ square ┆ 4     ┆ 9   │
│ blue  ┆ square ┆ 5     ┆ 9   │
└───────┴────────┴───────┴─────┘
shape: (3, 4)
┌───────┬────────┬───────┬─────┐
│ color ┆ shape  ┆ value ┆ sum │
│ ---   ┆ ---    ┆ ---   ┆ --- │
│ str   ┆ str    ┆ i64   ┆ i32 │
╞═══════╪════════╪═══════╪═════╡
│ blue  ┆ square ┆ 4     ┆ 9   │
│ red   ┆ square ┆ 3     ┆ 3   │
│ red   ┆ circle ┆ 1     ┆ 3   │
└───────┴────────┴───────┴─────┘

Installed versions

--------Version info---------
Polars:               0.20.7
Index type:           UInt32
Platform:             macOS-14.3-arm64-arm-64bit
Python:               3.10.11 | packaged by conda-forge | (main, May 10 2023, 19:01:19) [Clang 14.0.6 ]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.2.1
connectorx:           <not installed>
deltalake:            0.10.0
fsspec:               2023.12.2
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.2
numpy:                1.24.4
openpyxl:             3.1.2
pandas:               2.1.4
pyarrow:              15.0.0
pydantic:             1.10.13
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           1.4.48
xlsx2csv:             <not installed>
xlsxwriter:           3.1.0
@kmcentush kmcentush added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 10, 2024
@cmdlineluser
Copy link
Contributor

It seems to be hitting this:

if not all(isinstance(c, str) for c in self.by):
msg = "cannot call `map_groups` when grouping by an expression"
raise TypeError(msg)

df.group_by(["color", "shape"]).by
# (['color', 'shape'],)

If you drop the list, it works:

df.group_by("color", "shape").by
# ('color', 'shape')

So that check probably needs some adjusting.

@kmcentush
Copy link
Author

Yup, that did it. I'll close this to not pollute the issue tracker - the current documentation does show it should be a variable number of arguments rather than a list.

That deprecation may have slipped through (or I missed it). Thanks again!

@cmdlineluser
Copy link
Contributor

As far as I can tell, there was no deprecation and it should still work.

The group_by docs still contain:

Group by multiple columns by passing a list of column names.

df.group_by(["a", "b"]).agg(pl.max("c"))

So it does appear to be an accidental behaviour change.

Just pinging @deanm0000 / @stinodego as I think they were working on that area.

Perhaps they can confirm if this is something that needs to be fixed or not.

@deanm0000
Copy link
Collaborator

deanm0000 commented Feb 10, 2024

Well this #14099 is still waiting which would more than fix this.

The problem is that when adding **kwargs to group_by we took out by and so there's only *by. I didn't add a check to see if by[0] is a list that needs to be unpacked before checking that all the inputs are strings.

@stinodego
Copy link
Member

So it does appear to be an accidental behaviour change.

It was - the parsing logic is no longer correct.

@stinodego stinodego reopened this Feb 10, 2024
@stinodego stinodego changed the title map_groups no longer works when grouping by multiple expressions map_groups no longer works when passing a list of column names to group_by Feb 10, 2024
@stinodego stinodego added P-medium Priority: medium and removed needs triage Awaiting prioritization by a maintainer labels Feb 10, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Feb 10, 2024
@stinodego stinodego added the regression Issue introduced by a new release label Feb 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P-medium Priority: medium python Related to Python Polars regression Issue introduced by a new release
Projects
Status: Ready
Development

No branches or pull requests

4 participants