Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Pandas 1.0.5 → 1.1.0 behavior change on DataFrame.apply() where func returns tuple #35518

Open
2 of 3 tasks
dechamps opened this issue Aug 2, 2020 · 10 comments
Open
2 of 3 tasks
Labels
Apply Apply, Aggregate, Transform, Map Bug Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). Regression Functionality that used to work in a prior pandas version

Comments

@dechamps
Copy link

dechamps commented Aug 2, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

print(
  pd.DataFrame([['orig1', 'orig2']])
  .apply(func=lambda col: ('new1', 'new2')))

Output of Pandas 1.0.5

0    (new1, new2)
1    (new1, new2)
dtype: object

Output of Pandas 1.1.0

      0     1
0  new1  new1
1  new2  new2

It is not clear to me if this behaviour change is intended or not. I couldn't find anything obvious in the release notes.

Possibly related: #35517, #34909 @simonjayhawkins @jbrockmendel

This broke my code, which is actively relying on tuples being treated as scalars and stored as single objects (instead of being laid across the dataframe).

Output of pd.show_versions()

INSTALLED VERSIONS

commit : d9fff27
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 4.19.104+
Version : #1 SMP Wed Feb 19 05:26:34 PST 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.0
numpy : 1.19.1
pytz : 2018.9
dateutil : 2.8.1
pip : 19.3.1
setuptools : 49.2.0
Cython : 0.29.21
pytest : 3.6.4
hypothesis : None
sphinx : 1.8.5
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 4.2.6
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.7.6.1 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 5.5.0
pandas_datareader: None
bs4 : 4.6.3
bottleneck : 1.3.2
fsspec : 0.7.4
fastparquet : None
gcsfs : None
matplotlib : 3.2.2
numexpr : 2.7.1
odfpy : None
openpyxl : 2.5.9
pandas_gbq : 0.11.0
pyarrow : 0.14.1
pytables : None
pyxlsb : None
s3fs : 0.4.2
scipy : 1.4.1
sqlalchemy : 1.3.18
tables : 3.4.4
tabulate : 0.8.7
xarray : 0.15.1
xlrd : 1.1.0
xlwt : 1.3.0
numba : 0.48.0

@dechamps dechamps added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 2, 2020
dechamps added a commit to dechamps/LoudspeakerExplorer that referenced this issue Aug 2, 2020
The "tuple trick" to force Pandas to treat the return value of an apply
func as a scalar stopped working between Pandas 1.0.5 and 1.1.0:
  pandas-dev/pandas#35518
@simonjayhawkins simonjayhawkins added Apply Apply, Aggregate, Transform, Map Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 3, 2020
@simonjayhawkins simonjayhawkins added this to the 1.1.1 milestone Aug 3, 2020
@jbrockmendel
Copy link
Member

which is actively relying on tuples being treated as scalars and stored as single objects

If you have a viable way to avoid this in your code, I'd encourage you to use it. Regardless of how this issue is addressed, tuples-as-scalars is fragile

@dechamps
Copy link
Author

dechamps commented Aug 4, 2020

If you have a viable way to avoid this in your code, I'd encourage you to use it. Regardless of how this issue is addressed, tuples-as-scalars is fragile

Yep. Well at least this issue forced me to clean up my code :) I'm now wrapping the value inside a fully opaque container object.

@simonjayhawkins
Copy link
Member

moved off 1.1.2 milestone (scheduled for this week) as no PRs to fix in the pipeline

@simonjayhawkins simonjayhawkins modified the milestones: 1.1.3, 1.1.4 Oct 5, 2020
@simonjayhawkins
Copy link
Member

moved off 1.1.3 milestone (overdue) as no PRs to fix in the pipeline

@simonjayhawkins simonjayhawkins modified the milestones: 1.1.4, 1.1.5 Oct 29, 2020
@simonjayhawkins
Copy link
Member

moved off 1.1.4 milestone (scheduled for release tomorrow) as no PRs to fix in the pipeline

@jorisvandenbossche
Copy link
Member

According to the docstring, I would say that the behaviour of 1.0.5 was correct, and this is a regression.

@jbrockmendel would you have time to look into it?

@jreback jreback modified the milestones: 1.1.5, Contributions Welcome Nov 25, 2020
@simonjayhawkins simonjayhawkins added the Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). label Dec 20, 2020
@simonjayhawkins
Copy link
Member

According to the docstring

just to be clear, in the DataFrame.apply docstring https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html, the description for the result_type parameter is...

The default behaviour (None) depends on the return value of the applied function: list-like results will be returned as a Series of those. However if the apply function returns a Series these are expanded to columns.

The return type for user function in the OP is a tuple (considered list-like) so we expect a Series of those.

This issue also occurs with a very list-like list where we also expect the default result_type behaviour to be to reduce.

>>> pd.__version__
'1.3.0.dev0+100.g54682234e3'
>>>
>>> df = pd.DataFrame([["orig1", "orig2"]])
>>>
>>> df.apply(func=lambda col: ("new1", "new2"), result_type="reduce")
0    (new1, new2)
1    (new1, new2)
dtype: object
>>>
>>> df.apply(func=lambda col: ("new1", "new2"))
      0     1
0  new1  new1
1  new2  new2
>>>
>>> df.apply(func=lambda col: ["new1", "new2"], result_type="reduce")
0    [new1, new2]
1    [new1, new2]
dtype: object
>>>
>>> df.apply(func=lambda col: ["new1", "new2"])
      0     1
0  new1  new1
1  new2  new2
>>>

I would say that the behaviour of 1.0.5 was correct, and this is a regression.

agreed.

@jbrockmendel would you have time to look into it?

ping

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Dec 20, 2020
@simonjayhawkins
Copy link
Member

Possibly related: #35517, #34909 @simonjayhawkins @jbrockmendel

can confirm, first bad commit: [91802a9] PERF: avoid creating many Series in apply_standard (#34909)

@jbrockmendel
Copy link
Member

Aside from reverting #34909, the solution that comes to mind is calling the function on the first row in wrap_results_for_axis and seeing if we get a tuple. That runs into other problems with non-univalent or mutating functions.

@simonjayhawkins
Copy link
Member

removing milestone

@simonjayhawkins simonjayhawkins modified the milestones: 1.3, Contributions Welcome Jun 11, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). Regression Functionality that used to work in a prior pandas version
Projects
None yet
6 participants