"Or" does not have commutative property (possibly because it catches exceptions silently?) #13806

josepablog · 2016-07-26T18:37:57Z

Code Sample, a copy-pastable example if possible

import pandas as pd

#Create sample dataframe
test = pd.DataFrame({"a":["12", float("nan"), "3", "meow" ]})

# Select rows that are null or numeric:
print len(test[ test["a"].isnull() | test["a"].str.isnumeric()])

# Select rows that are numeric or null
print len(test[ test["a"].str.isnumeric() | test["a"].isnull() ])

# These two should return the same result, right? Unfortunately they don't

Notice that Series.str.isnumeric throws an exception on null entries. However, I believe that this may introduce bugs silently

Expected Output

Expected output is:
2
2

Instead, I get:
3
2

output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2016-07-26T21:00:02Z

FWIW, we follow python's logic here

In [40]: float('nan') or 1
Out[40]: nan

In [41]: 1 or float('nan')
Out[41]: 1

Your example is similar, the NaN | True vs True | NaN is causing the difference.

Thanks to short-circuiting, I don't think you can rely on | and or to be symmetrical.

In [44]: True or this_variable_is_undefined
Out[44]: True

The reverse of that throws a NameError.

However, I believe that this may introduce bugs silently

I suspect you're right 😄 I don't think we can change this behavior though.

TomAugspurger · 2016-07-26T21:06:49Z

I should also say that you can get the desired output by test.a.str.isnumeric().fillna(False) | test.a.isnull()

josepablog · 2016-07-26T21:29:58Z

I think I didn't phrase my issue correctly. We agree that short-circuiting may break commutativity.

I'm not talking about NaN | True. I'm talking about:
code that raises an exception | True

If you run isnumeric by itself, on my sample code, an exception is raised for the null values. The exception is NEVER raised when it's inside an or!

TomAugspurger · 2016-07-26T21:52:31Z

Ah, you're saying test['a'].str.isnumeric() raises an exception? I don't get one running master.

josepablog · 2016-07-26T22:05:46Z

@TomAugspurger thank you for getting back to me. This raises an exception to me:

test[ test["a"].str.isnumeric() ]

Which is the simplified version we were testing. The code you wrote, doesn't raise an exception for me either.

sinhrks · 2016-07-26T22:14:56Z

Yeah it raises ValueError because Series contains NaN.

# on current master
test["a"].str.isnumeric() 
# 0     True
# 1      NaN
# 2     True
# 3    False
# Name: a, dtype: object

I think .str.is_xxx method should return False rather than NaN if input is not string likes.

# expected
test["a"].str.isnumeric() 
# 0     True
# 1    False
# 2     True
# 3    False
# Name: a, dtype: object

josepablog · 2016-07-26T22:20:51Z

Possibly. I'm more worried about inconsistent behavior inside an or. Look:

test[ test["a"].str.isnumeric() ] Raises exception, which may be appropriate behavior
test[ test["a"].isnull() | test["a"].str.isnumeric()] Does not raise exception due to short-circuiting, therefore this is appropriate behavior
test[ test["a"].str.isnumeric() | test["a"].isnull() ]Does not raise an exception, although it should!

jorisvandenbossche · 2016-07-27T11:23:37Z

This boils down to the Series behaviour for | with Nans:

In [10]: pd.Series([True, np.nan]) | pd.Series([False, True])
Out[10]:
0     True
1    False
dtype: bool

In [11]: pd.Series([False, True]) | pd.Series([True, np.nan])
Out[11]:
0    True
1    True
dtype: bool

So the comparison never propagates the NaN. For this reason, you do not get any exception.
But, the difference between both is a bit surprising.

jorisvandenbossche · 2016-07-27T11:25:41Z

The issue about NaNs in booleans is already discussed in #6528, so closing here as a duplicate.

jorisvandenbossche · 2016-07-27T11:29:40Z

test[ test["a"].str.isnumeric() | test["a"].isnull() ]Does not raise an exception, although it should!

@josepablog Whether it should raise or not depends on the outcome of np.nan | True/False. This currently returns False. If we decide that False is the correct outcome, then the above code snippet should not raise. However, it could be argued that np.nan | True/False should return np.nan, and in that case the above code snippet will raise. But the return value of np.nan | True/False is rather independent of this specific use case, and as in my previous comment, this is covered in #6528

sinhrks · 2016-07-27T12:06:55Z

@jorisvandenbossche, @TomAugspurger what do you think .str accessor to return error case, NaN or False?

jorisvandenbossche · 2016-07-27T12:18:51Z

Ah yes, forgot that part of the issue. Is there any precendence for other of the string methods on how they handle errors / NaNs?

TomAugspurger · 2016-07-27T12:28:06Z

@sinhrks I prefer the current behavior of str.is_* propagating NAs.

It seems more consistent with other non-reducing behavior for how it treats NAs (eg. nan + 1 is nan)
If you seems easier to followup the is_* with a fillna than to somehow recover the NAs if you do want to propagate them.

We could document this better. This is a common enough gotcha that we should include a warning box in the prose docs. Another option is to add a fill_value to all the is* methods for NA behavior, defaulting to None/np.nan. That would give the behavior a bit more visibility.

TomAugspurger added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Usage Question labels Jul 26, 2016

sinhrks added the Strings String extension data type and string data label Jul 26, 2016

jorisvandenbossche closed this as completed Jul 27, 2016

jorisvandenbossche added this to the No action milestone Jul 27, 2016

jorisvandenbossche added the Duplicate Report Duplicate issue or pull request label Jul 27, 2016

This was referenced Aug 2, 2016

ERR/API: non-string types #13877

Closed

API: Series/DataFrame logical op NaN propagation #13896

Closed

h-vetinari mentioned this issue Sep 17, 2018

Mismatch in Series/DataFrame boolean ops behavior #22724

Closed

h-vetinari mentioned this issue Jan 28, 2019

RLS: 0.25.0 #24950

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Or" does not have commutative property (possibly because it catches exceptions silently?) #13806

"Or" does not have commutative property (possibly because it catches exceptions silently?) #13806

josepablog commented Jul 26, 2016 •

edited

Loading

TomAugspurger commented Jul 26, 2016 •

edited

Loading

TomAugspurger commented Jul 26, 2016 •

edited

Loading

josepablog commented Jul 26, 2016

TomAugspurger commented Jul 26, 2016

josepablog commented Jul 26, 2016

sinhrks commented Jul 26, 2016 •

edited

Loading

josepablog commented Jul 26, 2016

jorisvandenbossche commented Jul 27, 2016

jorisvandenbossche commented Jul 27, 2016

jorisvandenbossche commented Jul 27, 2016

sinhrks commented Jul 27, 2016

jorisvandenbossche commented Jul 27, 2016

TomAugspurger commented Jul 27, 2016

"Or" does not have commutative property (possibly because it catches exceptions silently?) #13806

"Or" does not have commutative property (possibly because it catches exceptions silently?) #13806

Comments

josepablog commented Jul 26, 2016 • edited Loading

Code Sample, a copy-pastable example if possible

Expected Output

output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Jul 26, 2016 • edited Loading

TomAugspurger commented Jul 26, 2016 • edited Loading

josepablog commented Jul 26, 2016

TomAugspurger commented Jul 26, 2016

josepablog commented Jul 26, 2016

sinhrks commented Jul 26, 2016 • edited Loading

josepablog commented Jul 26, 2016

jorisvandenbossche commented Jul 27, 2016

jorisvandenbossche commented Jul 27, 2016

jorisvandenbossche commented Jul 27, 2016

sinhrks commented Jul 27, 2016

jorisvandenbossche commented Jul 27, 2016

TomAugspurger commented Jul 27, 2016

josepablog commented Jul 26, 2016 •

edited

Loading

output of `pd.show_versions()`

TomAugspurger commented Jul 26, 2016 •

edited

Loading

TomAugspurger commented Jul 26, 2016 •

edited

Loading

sinhrks commented Jul 26, 2016 •

edited

Loading