.loc can't work effectively when I try to append a list to dataframe #2953

ifrozenwhale · 2021-04-01T19:17:04Z

System information

Linux Ubuntu: 2004:
Modin version (0.9.1):
Python version:3.8
Code we can use to reproduce:

import modin.pandas as pd


def list2DataFrame(columns, values):
    test_df = pd.DataFrame(columns=columns)
    
    test_df.loc[len(test_df)] = values
    return test_df

if __name__ == '__main__':
    columns = ['name', 'id']
    values = ['frozenwhale','2001']
    df = list2DataFrame(columns, values)
    print(df)

Describe the problem

When I try to add a line of list as a new value to a dataframe, it will report an error, prompting index error. Just like this:

Traceback (most recent call last):
  File "test_bug.py", line 14, in <module>
    df = list2DataFrame(columns, values)
  File "test_bug.py", line 8, in list2DataFrame
    test_df.loc[len(test_df)] = values
  File "/usr/local/lib/python3.8/dist-packages/modin/pandas/indexing.py", line 594, in __setitem__
    self.qc = self.qc.reindex(labels=index, axis=0)
  File "/usr/local/lib/python3.8/dist-packages/modin/backends/pandas/query_compiler.py", line 521, in reindex
    new_modin_frame = self._modin_frame._apply_full_axis(
  File "/usr/local/lib/python3.8/dist-packages/modin/engines/base/frame/data.py", line 1302, in _apply_full_axis
    return self.broadcast_apply_full_axis(
  File "/usr/local/lib/python3.8/dist-packages/modin/engines/base/frame/data.py", line 1706, in broadcast_apply_full_axis
    result = self.__constructor__(
  File "/usr/local/lib/python3.8/dist-packages/modin/engines/base/frame/data.py", line 82, in __init__
    self._filter_empties()
  File "/usr/local/lib/python3.8/dist-packages/modin/engines/base/frame/data.py", line 282, in _filter_empties
    self._column_widths_cache = [w for w in self._column_widths if w != 0]
  File "/usr/local/lib/python3.8/dist-packages/modin/engines/dask/pandas_on_dask/frame/data.py", line 50, in _column_widths
    for obj in self._partitions[0]
IndexError: index 0 is out of bounds for axis 0 with size 0

Source code / logs

but pandas works.

The text was updated successfully, but these errors were encountered:

devin-petersohn · 2021-04-05T13:49:29Z

Hi @ifrozenwhale, thanks for the report!

This looks like an issue with empty dataframes, I cannot reproduce the issue when there is content in the dataframe, only when it is empty. We are actively working on imrpoving the handling of empty dataframes. Thanks again for the report, we will get this fixed!

bstivers · 2022-01-09T23:32:59Z

I think I have encountered this same (or similar) issue, but with IndexError: index -1 is out of bounds for axis 0 with size 0. Also starting from an empty DataFrame.

Unsure if I should open a new issue or not. Same thing. Works with pandas. Doesn't with modin[ray].

Disclaimer: Have not tried to run this on a duplicated DataFrame (yet).

EDIT: df.copy() does work.

Environment

[tool.poetry.dependencies]
python = "3.9.9"
modin = { extras = ["ray"], version = "0.12.1" }
notebook = "6.4.6"

Full context

Tutorial: Using Pandas with Large Data Sets in Python

Issue

converted_obj = pd.DataFrame()   # DOES NOT WORK
# converted_obj = df_obj.copy()   # WORKS

for col in df_obj.columns:
    num_unique_values = len(df_obj[col].unique())
    num_total_values = len(df_obj[col])
    if num_unique_values / num_total_values < 0.5:
        converted_obj.loc[:,col] = df_obj[col].astype('category')
    else:
        converted_obj.loc[:,col] = df_obj[col]

IndexError                                Traceback (most recent call last)
/tmp/ipykernel_689608/3414656741.py in <module>
      5     num_total_values = len(df_obj[col])
      6     if num_unique_values / num_total_values < 0.5:
----> 7         converted_obj.loc[:,col] = df_obj[col].astype('category')
      8     else:
      9         converted_obj.loc[:,col] = df_obj[col]

~/Devel/redqueen/projects/icicles/.venv/lib/python3.9/site-packages/modin/pandas/indexing.py in __setitem__(self, key, item)
    659         else:
    660             row_lookup, col_lookup = self._compute_lookup(row_loc, col_loc)
--> 661             super(_LocIndexer, self).__setitem__(
    662                 row_lookup,
    663                 col_lookup,

~/Devel/redqueen/projects/icicles/.venv/lib/python3.9/site-packages/modin/pandas/indexing.py in __setitem__(self, row_lookup, col_lookup, item, axis)
    315         # should be handled in a fastpath with `df[col] = item`.
    316         if axis == 0:
--> 317             self.df[self.df.columns[col_lookup][0]] = item
    318         # This is True when we are assigning to a full row. We want to reuse the setitem
    319         # mechanism to operate along only one axis for performance reasons.

~/Devel/redqueen/projects/icicles/.venv/lib/python3.9/site-packages/pandas/core/indexes/base.py in __getitem__(self, key)
   4614             key = np.asarray(key, dtype=bool)
   4615 
-> 4616         result = getitem(key)
   4617         if not is_scalar(result):
   4618             # error: Argument 1 to "ndim" has incompatible type "Union[ExtensionArray,

IndexError: index -1 is out of bounds for axis 0 with size 0

UPDATE

Changing: converted_obj = pd.DataFrame() to converted_obj = df_obj.copy() works.
But not exactly optimal for big data.

devin-petersohn · 2022-01-13T15:44:08Z

Thanks @bstivers for confirming the issue! One minor comment:

Changing: converted_obj = pd.DataFrame() to converted_obj = df_obj.copy() works.
But not exactly optimal for big data.

Modin doesn't copy in the same way as pandas, and has a memory structure rooted in computer science fundamentals (copy-on-write if you are familiar with it). This means that no physical copy is created until you write to the objects themselves. In the workflow you show, I would expect the memory consumption to be the same as if the empty dataframe case worked.

The fix for this issue is slated for the next release. Thanks for the nicely reproducible examples, they really help us when trying to fix these cases!

RehanSD · 2022-01-13T23:31:29Z

Thank you @bstivers for opening this issue! This issue falls under the same problem as #3764, and PR #3765 is currently in progress, which should resolve the bug here and in #3764!

mvashishtha · 2022-09-06T17:13:09Z

Duplicate of #3764

ifrozenwhale added the bug 🦗 Something isn't working label Apr 1, 2021

devin-petersohn added this to the bugs and regressions milestone Apr 5, 2021

jeffreykennethli self-assigned this Jan 14, 2022

devin-petersohn assigned RehanSD Jan 18, 2022

jeffreykennethli removed their assignment Feb 1, 2022

mvashishtha marked this as a duplicate of #3764 Sep 6, 2022

mvashishtha closed this as completed Sep 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.loc can't work effectively when I try to append a list to dataframe #2953

.loc can't work effectively when I try to append a list to dataframe #2953

ifrozenwhale commented Apr 1, 2021

devin-petersohn commented Apr 5, 2021

bstivers commented Jan 9, 2022 •

edited

Loading

devin-petersohn commented Jan 13, 2022

RehanSD commented Jan 13, 2022

mvashishtha commented Sep 6, 2022

.loc can't work effectively when I try to append a list to dataframe #2953

.loc can't work effectively when I try to append a list to dataframe #2953

Comments

ifrozenwhale commented Apr 1, 2021

System information

Describe the problem

Source code / logs

devin-petersohn commented Apr 5, 2021

bstivers commented Jan 9, 2022 • edited Loading

Environment

Full context

Issue

UPDATE

devin-petersohn commented Jan 13, 2022

RehanSD commented Jan 13, 2022

mvashishtha commented Sep 6, 2022

bstivers commented Jan 9, 2022 •

edited

Loading