[BUG] when using id_vars in `.melt()` , the string of the column name is broken into characters #15758

taureandyernv · 2024-05-15T08:50:00Z

Describe the bug
When trying to create a melted dataframe with id_vars with a column name, for example, "index" i get the following error: KeyError: "The following 'id_vars' are not present in the DataFrame: ['e', 'x', 'n', 'd', 'i']"

Steps/Code to reproduce bug

import cudf
data = {
    'A': [1, None, 3],
    'B': [None, 5, 6],
    'C': [7, 8, None]
}
df = cudf.DataFrame(data)

# Reset the index to retain it
df_reset = df.reset_index()

# Melt the DataFrame while retaining the original index
melted_df = df_reset.melt(id_vars='index', var_name='column', value_name='value')

# Drop rows with NaN values
melted_df = melted_df.dropna()

# Set the original index back
melted_df = melted_df.set_index('index')

print(melted_df)

Outputs:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[9], line 13
     10 df_reset = df.reset_index()
     12 # Melt the DataFrame while retaining the original index
---> 13 melted_df = df_reset.melt(id_vars='index', var_name='column', value_name='value')
     15 # Drop rows with NaN values
     16 melted_df = melted_df.dropna()

File /opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py:116, in annotate.__call__.<locals>.inner(*args, **kwargs)
    113 @wraps(func)
    114 def inner(*args, **kwargs):
    115     libnvtx_push_range(self.attributes, self.domain.handle)
--> 116     result = func(*args, **kwargs)
    117     libnvtx_pop_range(self.domain.handle)
    118     return result

File /opt/conda/lib/python3.10/site-packages/cudf/core/dataframe.py:4077, in DataFrame.melt(self, **kwargs)
   4051 """Unpivots a DataFrame from wide format to long format,
   4052 optionally leaving identifier variables set.
   4053 
   (...)
   4073     Melted result
   4074 """
   4075 from cudf.core.reshape import melt
-> 4077 return melt(self, **kwargs)

File /opt/conda/lib/python3.10/site-packages/cudf/core/reshape.py:532, in melt(frame, id_vars, value_vars, var_name, value_name, col_level)
    530     missing = set(id_vars) - set(frame._column_names)
    531     if not len(missing) == 0:
--> 532         raise KeyError(
    533             f"The following 'id_vars' are not present"
    534             f" in the DataFrame: {list(missing)}"
    535         )
    536 else:
    537     id_vars = []

KeyError: "The following 'id_vars' are not present in the DataFrame: ['e', 'x', 'n', 'd', 'i']"

Expected behavior

import pandas as pd
data = {
    'A': [1, None, 3],
    'B': [None, 5, 6],
    'C': [7, 8, None]
}
df = pd.DataFrame(data)

# Reset the index to retain it
df_reset = df.reset_index()

# Melt the DataFrame while retaining the original index
melted_df = df_reset.melt(id_vars='index', var_name='column', value_name='value')

# Drop rows with NaN values
melted_df = melted_df.dropna()

# Set the original index back
melted_df = melted_df.set_index('index')

print(melted_df)

Outputs:

      column  value
index              
0          A    1.0
2          A    3.0
1          B    5.0
2          B    6.0
0          C    7.0
1          C    8.0

Environment overview (please complete the following information)

Environment location: Docker
Method of cuDF install: Docker,

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Additional context
I also tried giving it a numerical column id, a single character, and a dataframe column for kicks. All failed with expected or similar errors. While it doesn't fail when using cudf.pandas, the fallback to pandas does dramatically slows down cudf.pandas to the point where it negates many of the speed ups in your workflow

The text was updated successfully, but these errors were encountered:

ayushdg · 2024-05-15T23:48:26Z

Looks like the issue here is that the melt api expects id_vars to be a list/tuple/ndarray type, and the check here fails to handle the case where a string is passed in.

As a workaround passing in the id_vars as a list: melted_df = df_reset.melt(id_vars=['index'], var_name='column', value_name='value') will give the expected result.

mroeschke · 2024-05-16T01:54:54Z

Thanks for the report. I have a PR to fix in this issue (#15765) and should hopefully be fixed in 24.06

closes #15758 Also fixes an inconsistency with pandas where `var_name` data was always a `Categorical` unlike pandas Authors: - Matthew Roeschke (https://github.com/mroeschke) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #15765

Saintfemi · 2024-08-19T09:06:44Z

from pandas import DataFrame

data = yf.download(tickers, start=start_date, end=end_date, progress=False)

reset index to bring Date into the columns for the melt function

data2= DataFrame(data).reset_index()

print(data2.columns)

data_types =data2.dtypes

data_melted = data2.melt(id_vars= 'Date')

data_melted

Cell Outputs
MultiIndex([( 'Date', ''),
('Adj Close', 'HDFCBANK.NS'),
('Adj Close', 'INFY.NS'),
('Adj Close', 'RELIANCE.NS'),
('Adj Close', 'TCS.NS'),
( 'Close', 'HDFCBANK.NS'),
( 'Close', 'INFY.NS'),
( 'Close', 'RELIANCE.NS'),
( 'Close', 'TCS.NS'),
( 'High', 'HDFCBANK.NS'),
( 'High', 'INFY.NS'),
( 'High', 'RELIANCE.NS'),
( 'High', 'TCS.NS'),
( 'Low', 'HDFCBANK.NS'),
( 'Low', 'INFY.NS'),
( 'Low', 'RELIANCE.NS'),
( 'Low', 'TCS.NS'),
( 'Open', 'HDFCBANK.NS'),
( 'Open', 'INFY.NS'),
( 'Open', 'RELIANCE.NS'),
( 'Open', 'TCS.NS'),
( 'Volume', 'HDFCBANK.NS'),
( 'Volume', 'INFY.NS'),
( 'Volume', 'RELIANCE.NS'),
( 'Volume', 'TCS.NS')],
names=['Price', 'Ticker'])

{
"name": "KeyError",
"message": ""The following id_vars or value_vars are not present in the DataFrame: ['Date']"",
"stack": "---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[53], line 16
11 print(data2.columns)
14 data_types =data2.dtypes
---> 16 data_melted = data2.melt(id_vars= 'Date')
18 data_melted

File c:\Users\Oluwanifemi.Amao\Documents\demo-project\.venv\Lib\site-packages\pandas\core\frame.py:9942, in DataFrame.melt(self, id_vars, value_vars, var_name, value_name, col_level, ignore_index)
9932 @appender(_shared_docs["melt"] % {"caller": "df.melt(", "other": "melt"})
9933 def melt(
9934 self,
(...)
9940 ignore_index: bool = True,
9941 ) -> DataFrame:
-> 9942 return melt(
9943 self,
9944 id_vars=id_vars,
9945 value_vars=value_vars,
9946 var_name=var_name,
9947 value_name=value_name,
9948 col_level=col_level,
9949 ignore_index=ignore_index,
9950 ).finalize(self, method="melt")

File c:\Users\Oluwanifemi.Amao\Documents\demo-project\.venv\Lib\site-packages\pandas\core\reshape\melt.py:74, in melt(frame, id_vars, value_vars, var_name, value_name, col_level, ignore_index)
70 if missing.any():
71 missing_labels = [
72 lab for lab, not_found in zip(labels, missing) if not_found
73 ]
---> 74 raise KeyError(
75 "The following id_vars or value_vars are not present in "
76 f"the DataFrame: {missing_labels}"
77 )
78 if value_vars_was_not_none:
79 frame = frame.iloc[:, algos.unique(idx)]

KeyError: "The following id_vars or value_vars are not present in the DataFrame: ['Date']""
}

taureandyernv added the bug Something isn't working label May 15, 2024

github-project-automation bot added this to cuDF/Dask/Numba/UCX May 15, 2024

github-project-automation bot moved this to In Progress in cuDF/Dask/Numba/UCX May 15, 2024

mroeschke mentioned this issue May 16, 2024

Fix id_vars and value_vars not accepting string scalars in melt #15765

Merged

3 tasks

rapids-bot bot closed this as completed in #15765 May 16, 2024

github-project-automation bot moved this from In Progress to Done in cuDF/Dask/Numba/UCX May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] when using id_vars in `.melt()` , the string of the column name is broken into characters #15758

[BUG] when using id_vars in `.melt()` , the string of the column name is broken into characters #15758

taureandyernv commented May 15, 2024

ayushdg commented May 15, 2024

mroeschke commented May 16, 2024

Saintfemi commented Aug 19, 2024

[BUG] when using id_vars in .melt() , the string of the column name is broken into characters #15758

[BUG] when using id_vars in .melt() , the string of the column name is broken into characters #15758

Comments

taureandyernv commented May 15, 2024

ayushdg commented May 15, 2024

mroeschke commented May 16, 2024

Saintfemi commented Aug 19, 2024

reset index to bring Date into the columns for the melt function

[BUG] when using id_vars in `.melt()` , the string of the column name is broken into characters #15758

[BUG] when using id_vars in `.melt()` , the string of the column name is broken into characters #15758