Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception when using loc[boolean] in assignment #1044

Closed
gshimansky opened this issue Feb 3, 2020 · 14 comments
Closed

Exception when using loc[boolean] in assignment #1044

gshimansky opened this issue Feb 3, 2020 · 14 comments
Labels
bug 🦗 Something isn't working pandas 🤔 Weird Behaviors of Pandas
Milestone

Comments

@gshimansky
Copy link
Collaborator

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    Ubuntu 19.10
  • Modin installed from (source or binary):
    Source
  • Modin version:
    Git master revision 0.7.0+3.g7528bf3
  • Python version:
    Python 3.7.5
  • Exact command to reproduce:

Describe the problem

The following reproducer code works on pandas but generates an exception in assignment

Source code / logs

#import pandas as pd
import ray
ray.init(huge_pages=True, plasma_directory="/mnt/hugepages")
import modin.pandas as pd

df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
                  index=['cobra', 'viper', 'sidewinder'],
                  columns=['max_speed', 'shield'])
print(df)

condition = df['shield'] > 6
print(condition)

df.loc[condition, 'new_col'] = df.loc[condition, 'max_speed']
print(df)
@devin-petersohn
Copy link
Collaborator

Thanks @gshimansky for the report!

I can reproduce this locally and will get this fixed for the next release. Thanks again for reporting!

Here's the Traceback for future reference:

AttributeError                            Traceback (most recent call last)
<ipython-input-1-2f46b9138d43> in <module>
      9 print(condition)
     10 
---> 11 df.loc[condition, 'new_col'] = df.loc[condition, 'max_speed']
     12 print(df)

~/software_builds/modin/modin/pandas/indexing.py in __setitem__(self, key, item)
    266         ):
    267             new_col = pandas.Series(index=self.df.index)
--> 268             new_col[row_loc] = item
    269             self.df.insert(loc=len(self.df.columns), column=col_loc[0], value=new_col)
    270             self.qc = self.df._query_compiler

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/series.py in __setitem__(self, key, value)
   1242         # do the setitem
   1243         cacher_needs_updating = self._check_is_chained_assignment_possible()
-> 1244         setitem(key, value)
   1245         if cacher_needs_updating:
   1246             self._maybe_update_cacher()

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/series.py in setitem(key, value)
   1238                     pass
   1239 
-> 1240             self._set_with(key, value)
   1241 
   1242         # do the setitem

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/series.py in _set_with(self, key, value)
   1297                     return self._set_values(key, value)
   1298             elif key_type == "boolean":
-> 1299                 self._set_values(key.astype(np.bool_), value)
   1300             else:
   1301                 self._set_labels(key, value)

AttributeError: 'list' object has no attribute 'astype'

@devin-petersohn devin-petersohn added this to the 0.7.1 milestone Feb 3, 2020
@devin-petersohn devin-petersohn added bug 🦗 Something isn't working pandas 🤔 Weird Behaviors of Pandas labels Feb 3, 2020
@gshimansky
Copy link
Collaborator Author

Hi. I wrote another reproducer which results in quite different exception stack trace (because I think it uses a different indexing engine in Pandas). The only difference is that integers are used in index, but error message is quite different.

#import pandas as pd
import ray
ray.init(huge_pages=True, plasma_directory="/mnt/hugepages")
import modin.pandas as pd

df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
                  index=[1, 2, 3],
                  columns=['max_speed', 'shield'])
print(df)

condition = df['shield'] > 6
print(condition)

df.loc[condition, 'new_col'] = df.loc[condition, 'max_speed']
print(df)

@devin-petersohn
Copy link
Collaborator

Interesting @gshimansky, would you post the Traceback you get? When I used your code above it gave me the trace in my previous comment.

@gshimansky
Copy link
Collaborator Author

Ok

$ python3 loc_test2.py
2020-02-04 11:19:16,846 INFO resource_spec.py:216 -- Starting Ray with 94.63 GiB memory available for workers and up to 18.63 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-02-04 11:19:17,098 WARNING services.py:1365 -- WARNING: object_store_memory is not verified when plasma_directory is set.
UserWarning: Distributing <class 'list'> object. This may take some time.
   max_speed  shield
1          1       2
2          4       5
3          7       8
1    False
2    False
3     True
Name: shield, dtype: bool
Traceback (most recent call last):
  File "/nfs/site/home/gashiman/.local/lib/python3.7/site-packages/pandas/core/series.py", line 1255, in _set_with_engine
    self.index._engine.set_value(values, key, value)
  File "pandas/_libs/index.pyx", line 94, in pandas._libs.index.IndexEngine.set_value
  File "pandas/_libs/index.pyx", line 102, in pandas._libs.index.IndexEngine.set_value
  File "pandas/_libs/index.pyx", line 128, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index_class_helper.pxi", line 91, in pandas._libs.index.Int64Engine._check_type
KeyError: 1    False
2    False
3     True
Name: shield, dtype: bool

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/nfs/site/home/gashiman/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 128, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index_class_helper.pxi", line 91, in pandas._libs.index.Int64Engine._check_type
KeyError: 1    False
2    False
3     True
Name: shield, dtype: bool

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/nfs/site/home/gashiman/.local/lib/python3.7/site-packages/pandas/core/series.py", line 1193, in setitem
    self._set_with_engine(key, value)
  File "/nfs/site/home/gashiman/.local/lib/python3.7/site-packages/pandas/core/series.py", line 1258, in _set_with_engine
    values[self.index.get_loc(key)] = value
  File "/nfs/site/home/gashiman/.local/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2899, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 128, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index_class_helper.pxi", line 91, in pandas._libs.index.Int64Engine._check_type
KeyError: 1    False
2    False
3     True
Name: shield, dtype: bool

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "loc_test2.py", line 14, in <module>
    df.loc[condition, 'new_col'] = df.loc[condition, 'max_speed']
  File "/nfs/site/proj/scripting_tools/gashiman/modin/modin/pandas/indexing.py", line 268, in __setitem__
    new_col[row_loc] = item
  File "/nfs/site/home/gashiman/.local/lib/python3.7/site-packages/pandas/core/series.py", line 1244, in __setitem__
    setitem(key, value)
  File "/nfs/site/home/gashiman/.local/lib/python3.7/site-packages/pandas/core/series.py", line 1221, in setitem
    self.loc[key] = value
  File "/nfs/site/home/gashiman/.local/lib/python3.7/site-packages/pandas/core/indexing.py", line 204, in __setitem__
    indexer = self._get_setitem_indexer(key)
  File "/nfs/site/home/gashiman/.local/lib/python3.7/site-packages/pandas/core/indexing.py", line 191, in _get_setitem_indexer
    return self._convert_to_indexer(key, axis=axis, is_setter=True)
  File "/nfs/site/home/gashiman/.local/lib/python3.7/site-packages/pandas/core/indexing.py", line 1285, in _convert_to_indexer
    return self._get_listlike_indexer(obj, axis, **kwargs)[1]
  File "/nfs/site/home/gashiman/.local/lib/python3.7/site-packages/pandas/core/indexing.py", line 1092, in _get_listlike_indexer
    keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
  File "/nfs/site/home/gashiman/.local/lib/python3.7/site-packages/pandas/core/indexing.py", line 1177, in _validate_read_indexer
    key=key, axis=self.obj._get_axis_name(axis)
KeyError: "None of [Index([False, False, True], dtype='object')] are in the [index]"

@gshimansky
Copy link
Collaborator Author

The reason why I added a second reproducer is that this is exactly the error that I am getting. When I wrote a first reproducer I saw that errors are different, but couldn't make it to produce exception that I am getting in my code. Today I managed to find the difference, it is because in first reproducer Pandas uses pandas._libs.index.ObjectEngine while in second it uses pandas._libs.index.Int64Engine. They behave differently and generate different exceptions.

@devin-petersohn
Copy link
Collaborator

Thanks @gshimansky, there may be two bugs here. I'll dig into this and get back. The indexing logic is a bit complex and dense because pandas allows so many different ways of using loc. I should have time to get into this today or tomorrow, depending on the complexity of the fix. Thanks again, these cases are very helpful!

@devin-petersohn
Copy link
Collaborator

I dug into this a bit and it's going to be a difficult edge case to solve. Basically the challenge is assigning a new column with a boolean mask at the same time.

One option is to insert the new column, then correct the values after the new column is generated. Another option is to reindex the new Series and then apply the NaN values to the Series before insertion. I will have to play with some things to figure out how we want to do this, but it is a challenging behavior to support generally because of the fact that we're setting a subset of the data to a subset of some other data and filling NaN values for the rest, which may not be possible in one pass.

@gshimansky
Copy link
Collaborator Author

Would this special case slow down all other loc paths? I wonder how easy it is to detect in python that argument is an array of booleans.

@devin-petersohn
Copy link
Collaborator

Detecting the booleans isn't the issue in this case (we already do that), it's creating a new column with a boolean index value to another Series that is also a boolean indexed set of values. Here's an example from above:

df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
                  index=[1, 2, 3],
                  columns=['max_speed', 'shield'])
print(df)

condition = df['shield'] > 6
new_values = df.loc[condition, 'max_speed'] # A Series with one value
df.loc[condition, 'new_col'] = new_values # Create a new column with the subset of values

I tested some other weird things, like if the boolean index doesn't line up with the new values (changing the condition in the example). In that case it's all NaN values. Generally solving it will take some thought I think. I don't think it would slow down other loc paths, though.

@gshimansky
Copy link
Collaborator Author

Yes, I didn't realize that condition array may be of an arbitrary length not necessarily aligned with dataframe length.

@devin-petersohn devin-petersohn modified the milestones: 0.7.1, 0.7.2 Feb 19, 2020
@devin-petersohn devin-petersohn modified the milestones: 0.7.3, 0.7.4 Apr 26, 2020
@devin-petersohn
Copy link
Collaborator

Pushing to next release

@devin-petersohn devin-petersohn modified the milestones: 0.8.0, 0.8.1 Jul 24, 2020
@anmyachev anmyachev modified the milestones: 0.8.1, Someday Oct 14, 2020
@kunal-gohrani
Copy link

kunal-gohrani commented Oct 15, 2020

Hi, since this issue is there for quite sometime, i wanted to know if theres a work around for this.

my code is doing something like:
data[condition, 'NEW COLUMN'] = 'BUY'

and this is producing error

@devin-petersohn
Copy link
Collaborator

@kunal-gohrani Thanks for posting!

Does data.loc[condition, 'NEW COLUMN'] = 'BUY' work? I think it may be unrelated to this issue if it is the case. This issue is specific to cases where the left and right objects do not align perfectly, but single value assignment shouldn't be giving any error.

@mvashishtha
Copy link
Collaborator

error from the original reproducer is now KeyError: array(['new_col'], dtype='<U7') at modin c9fc326. sounds like #3764.

I'll close this issue as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working pandas 🤔 Weird Behaviors of Pandas
Projects
None yet
Development

No branches or pull requests

5 participants