Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Or" does not have commutative property (possibly because it catches exceptions silently?) #13806

Closed
josepablog opened this issue Jul 26, 2016 · 13 comments
Labels
Duplicate Report Duplicate issue or pull request Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Strings String extension data type and string data Usage Question

Comments

@josepablog
Copy link

josepablog commented Jul 26, 2016

Code Sample, a copy-pastable example if possible

import pandas as pd

#Create sample dataframe
test = pd.DataFrame({"a":["12", float("nan"), "3", "meow" ]})

# Select rows that are null or numeric:
print len(test[ test["a"].isnull() | test["a"].str.isnumeric()])

# Select rows that are numeric or null
print len(test[ test["a"].str.isnumeric() | test["a"].isnull() ])

# These two should return the same result, right? Unfortunately they don't

Notice that Series.str.isnumeric throws an exception on null entries. However, I believe that this may introduce bugs silently

Expected Output

Expected output is:
2
2

Instead, I get:
3
2

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 26, 2016

FWIW, we follow python's logic here

In [40]: float('nan') or 1
Out[40]: nan

In [41]: 1 or float('nan')
Out[41]: 1

Your example is similar, the NaN | True vs True | NaN is causing the difference.

Thanks to short-circuiting, I don't think you can rely on | and or to be symmetrical.

In [44]: True or this_variable_is_undefined
Out[44]: True

The reverse of that throws a NameError.

However, I believe that this may introduce bugs silently

I suspect you're right 😄 I don't think we can change this behavior though.

@TomAugspurger TomAugspurger added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Usage Question labels Jul 26, 2016
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 26, 2016

I should also say that you can get the desired output by test.a.str.isnumeric().fillna(False) | test.a.isnull()

@josepablog
Copy link
Author

I think I didn't phrase my issue correctly. We agree that short-circuiting may break commutativity.

I'm not talking about NaN | True. I'm talking about:
code that raises an exception | True

If you run isnumeric by itself, on my sample code, an exception is raised for the null values. The exception is NEVER raised when it's inside an or!

@TomAugspurger
Copy link
Contributor

Ah, you're saying test['a'].str.isnumeric() raises an exception? I don't get one running master.

@josepablog
Copy link
Author

@TomAugspurger thank you for getting back to me. This raises an exception to me:

test[ test["a"].str.isnumeric() ]

Which is the simplified version we were testing. The code you wrote, doesn't raise an exception for me either.

@sinhrks
Copy link
Member

sinhrks commented Jul 26, 2016

Yeah it raises ValueError because Series contains NaN.

# on current master
test["a"].str.isnumeric() 
# 0     True
# 1      NaN
# 2     True
# 3    False
# Name: a, dtype: object

I think .str.is_xxx method should return False rather than NaN if input is not string likes.

# expected
test["a"].str.isnumeric() 
# 0     True
# 1    False
# 2     True
# 3    False
# Name: a, dtype: object

@sinhrks sinhrks added the Strings String extension data type and string data label Jul 26, 2016
@josepablog
Copy link
Author

Possibly. I'm more worried about inconsistent behavior inside an or. Look:

  • test[ test["a"].str.isnumeric() ] Raises exception, which may be appropriate behavior
  • test[ test["a"].isnull() | test["a"].str.isnumeric()] Does not raise exception due to short-circuiting, therefore this is appropriate behavior
  • test[ test["a"].str.isnumeric() | test["a"].isnull() ]Does not raise an exception, although it should!

@jorisvandenbossche
Copy link
Member

This boils down to the Series behaviour for | with Nans:

In [10]: pd.Series([True, np.nan]) | pd.Series([False, True])
Out[10]:
0     True
1    False
dtype: bool

In [11]: pd.Series([False, True]) | pd.Series([True, np.nan])
Out[11]:
0    True
1    True
dtype: bool

So the comparison never propagates the NaN. For this reason, you do not get any exception.
But, the difference between both is a bit surprising.

@jorisvandenbossche
Copy link
Member

The issue about NaNs in booleans is already discussed in #6528, so closing here as a duplicate.

@jorisvandenbossche jorisvandenbossche added this to the No action milestone Jul 27, 2016
@jorisvandenbossche jorisvandenbossche added the Duplicate Report Duplicate issue or pull request label Jul 27, 2016
@jorisvandenbossche
Copy link
Member

test[ test["a"].str.isnumeric() | test["a"].isnull() ]Does not raise an exception, although it should!

@josepablog Whether it should raise or not depends on the outcome of np.nan | True/False. This currently returns False. If we decide that False is the correct outcome, then the above code snippet should not raise. However, it could be argued that np.nan | True/False should return np.nan, and in that case the above code snippet will raise. But the return value of np.nan | True/False is rather independent of this specific use case, and as in my previous comment, this is covered in #6528

@sinhrks
Copy link
Member

sinhrks commented Jul 27, 2016

@jorisvandenbossche, @TomAugspurger what do you think .str accessor to return error case, NaN or False?

@jorisvandenbossche
Copy link
Member

Ah yes, forgot that part of the issue. Is there any precendence for other of the string methods on how they handle errors / NaNs?

@TomAugspurger
Copy link
Contributor

@sinhrks I prefer the current behavior of str.is_* propagating NAs.

  1. It seems more consistent with other non-reducing behavior for how it treats NAs (eg. nan + 1 is nan)
  2. If you seems easier to followup the is_* with a fillna than to somehow recover the NAs if you do want to propagate them.

We could document this better. This is a common enough gotcha that we should include a warning box in the prose docs. Another option is to add a fill_value to all the is* methods for NA behavior, defaulting to None/np.nan. That would give the behavior a bit more visibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Strings String extension data type and string data Usage Question
Projects
None yet
Development

No branches or pull requests

4 participants