Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.loc can't work effectively when I try to append a list to dataframe #2953

Closed
ifrozenwhale opened this issue Apr 1, 2021 · 5 comments
Closed
Assignees
Labels
bug 🦗 Something isn't working

Comments

@ifrozenwhale
Copy link

System information

  • Linux Ubuntu: 2004:
  • Modin version (0.9.1):
  • Python version:3.8
  • Code we can use to reproduce:
import modin.pandas as pd


def list2DataFrame(columns, values):
    test_df = pd.DataFrame(columns=columns)
    
    test_df.loc[len(test_df)] = values
    return test_df

if __name__ == '__main__':
    columns = ['name', 'id']
    values = ['frozenwhale','2001']
    df = list2DataFrame(columns, values)
    print(df)

Describe the problem

When I try to add a line of list as a new value to a dataframe, it will report an error, prompting index error. Just like this:

Traceback (most recent call last):
  File "test_bug.py", line 14, in <module>
    df = list2DataFrame(columns, values)
  File "test_bug.py", line 8, in list2DataFrame
    test_df.loc[len(test_df)] = values
  File "/usr/local/lib/python3.8/dist-packages/modin/pandas/indexing.py", line 594, in __setitem__
    self.qc = self.qc.reindex(labels=index, axis=0)
  File "/usr/local/lib/python3.8/dist-packages/modin/backends/pandas/query_compiler.py", line 521, in reindex
    new_modin_frame = self._modin_frame._apply_full_axis(
  File "/usr/local/lib/python3.8/dist-packages/modin/engines/base/frame/data.py", line 1302, in _apply_full_axis
    return self.broadcast_apply_full_axis(
  File "/usr/local/lib/python3.8/dist-packages/modin/engines/base/frame/data.py", line 1706, in broadcast_apply_full_axis
    result = self.__constructor__(
  File "/usr/local/lib/python3.8/dist-packages/modin/engines/base/frame/data.py", line 82, in __init__
    self._filter_empties()
  File "/usr/local/lib/python3.8/dist-packages/modin/engines/base/frame/data.py", line 282, in _filter_empties
    self._column_widths_cache = [w for w in self._column_widths if w != 0]
  File "/usr/local/lib/python3.8/dist-packages/modin/engines/dask/pandas_on_dask/frame/data.py", line 50, in _column_widths
    for obj in self._partitions[0]
IndexError: index 0 is out of bounds for axis 0 with size 0

Source code / logs

but pandas works.

@ifrozenwhale ifrozenwhale added the bug 🦗 Something isn't working label Apr 1, 2021
@devin-petersohn devin-petersohn added this to the bugs and regressions milestone Apr 5, 2021
@devin-petersohn
Copy link
Collaborator

Hi @ifrozenwhale, thanks for the report!

This looks like an issue with empty dataframes, I cannot reproduce the issue when there is content in the dataframe, only when it is empty. We are actively working on imrpoving the handling of empty dataframes. Thanks again for the report, we will get this fixed!

@bstivers
Copy link

bstivers commented Jan 9, 2022

I think I have encountered this same (or similar) issue, but with IndexError: index -1 is out of bounds for axis 0 with size 0. Also starting from an empty DataFrame.

Unsure if I should open a new issue or not. Same thing. Works with pandas. Doesn't with modin[ray].

Disclaimer: Have not tried to run this on a duplicated DataFrame (yet).

EDIT: df.copy() does work.

Environment

[tool.poetry.dependencies]
python = "3.9.9"
modin = { extras = ["ray"], version = "0.12.1" }
notebook = "6.4.6"

Full context

Tutorial: Using Pandas with Large Data Sets in Python

Issue

converted_obj = pd.DataFrame()   # DOES NOT WORK
# converted_obj = df_obj.copy()   # WORKS

for col in df_obj.columns:
    num_unique_values = len(df_obj[col].unique())
    num_total_values = len(df_obj[col])
    if num_unique_values / num_total_values < 0.5:
        converted_obj.loc[:,col] = df_obj[col].astype('category')
    else:
        converted_obj.loc[:,col] = df_obj[col]
IndexError                                Traceback (most recent call last)
/tmp/ipykernel_689608/3414656741.py in <module>
      5     num_total_values = len(df_obj[col])
      6     if num_unique_values / num_total_values < 0.5:
----> 7         converted_obj.loc[:,col] = df_obj[col].astype('category')
      8     else:
      9         converted_obj.loc[:,col] = df_obj[col]

~/Devel/redqueen/projects/icicles/.venv/lib/python3.9/site-packages/modin/pandas/indexing.py in __setitem__(self, key, item)
    659         else:
    660             row_lookup, col_lookup = self._compute_lookup(row_loc, col_loc)
--> 661             super(_LocIndexer, self).__setitem__(
    662                 row_lookup,
    663                 col_lookup,

~/Devel/redqueen/projects/icicles/.venv/lib/python3.9/site-packages/modin/pandas/indexing.py in __setitem__(self, row_lookup, col_lookup, item, axis)
    315         # should be handled in a fastpath with `df[col] = item`.
    316         if axis == 0:
--> 317             self.df[self.df.columns[col_lookup][0]] = item
    318         # This is True when we are assigning to a full row. We want to reuse the setitem
    319         # mechanism to operate along only one axis for performance reasons.

~/Devel/redqueen/projects/icicles/.venv/lib/python3.9/site-packages/pandas/core/indexes/base.py in __getitem__(self, key)
   4614             key = np.asarray(key, dtype=bool)
   4615 
-> 4616         result = getitem(key)
   4617         if not is_scalar(result):
   4618             # error: Argument 1 to "ndim" has incompatible type "Union[ExtensionArray,

IndexError: index -1 is out of bounds for axis 0 with size 0

UPDATE

Changing: converted_obj = pd.DataFrame() to converted_obj = df_obj.copy() works.
But not exactly optimal for big data.

@devin-petersohn
Copy link
Collaborator

Thanks @bstivers for confirming the issue! One minor comment:

Changing: converted_obj = pd.DataFrame() to converted_obj = df_obj.copy() works.
But not exactly optimal for big data.

Modin doesn't copy in the same way as pandas, and has a memory structure rooted in computer science fundamentals (copy-on-write if you are familiar with it). This means that no physical copy is created until you write to the objects themselves. In the workflow you show, I would expect the memory consumption to be the same as if the empty dataframe case worked.

The fix for this issue is slated for the next release. Thanks for the nicely reproducible examples, they really help us when trying to fix these cases!

@RehanSD
Copy link
Collaborator

RehanSD commented Jan 13, 2022

Thank you @bstivers for opening this issue! This issue falls under the same problem as #3764, and PR #3765 is currently in progress, which should resolve the bug here and in #3764!

@jeffreykennethli jeffreykennethli self-assigned this Jan 14, 2022
@jeffreykennethli jeffreykennethli removed their assignment Feb 1, 2022
@mvashishtha
Copy link
Collaborator

Duplicate of #3764

@mvashishtha mvashishtha marked this as a duplicate of #3764 Sep 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants