Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] when using id_vars in .melt() , the string of the column name is broken into characters #15758

Closed
taureandyernv opened this issue May 15, 2024 · 3 comments · Fixed by #15765
Labels
bug Something isn't working

Comments

@taureandyernv
Copy link
Contributor

Describe the bug
When trying to create a melted dataframe with id_vars with a column name, for example, "index" i get the following error: KeyError: "The following 'id_vars' are not present in the DataFrame: ['e', 'x', 'n', 'd', 'i']"

Steps/Code to reproduce bug

import cudf
data = {
    'A': [1, None, 3],
    'B': [None, 5, 6],
    'C': [7, 8, None]
}
df = cudf.DataFrame(data)

# Reset the index to retain it
df_reset = df.reset_index()

# Melt the DataFrame while retaining the original index
melted_df = df_reset.melt(id_vars='index', var_name='column', value_name='value')

# Drop rows with NaN values
melted_df = melted_df.dropna()

# Set the original index back
melted_df = melted_df.set_index('index')

print(melted_df)

Outputs:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[9], line 13
     10 df_reset = df.reset_index()
     12 # Melt the DataFrame while retaining the original index
---> 13 melted_df = df_reset.melt(id_vars='index', var_name='column', value_name='value')
     15 # Drop rows with NaN values
     16 melted_df = melted_df.dropna()

File /opt/conda/lib/python3.10/site-packages/nvtx/nvtx.py:116, in annotate.__call__.<locals>.inner(*args, **kwargs)
    113 @wraps(func)
    114 def inner(*args, **kwargs):
    115     libnvtx_push_range(self.attributes, self.domain.handle)
--> 116     result = func(*args, **kwargs)
    117     libnvtx_pop_range(self.domain.handle)
    118     return result

File /opt/conda/lib/python3.10/site-packages/cudf/core/dataframe.py:4077, in DataFrame.melt(self, **kwargs)
   4051 """Unpivots a DataFrame from wide format to long format,
   4052 optionally leaving identifier variables set.
   4053 
   (...)
   4073     Melted result
   4074 """
   4075 from cudf.core.reshape import melt
-> 4077 return melt(self, **kwargs)

File /opt/conda/lib/python3.10/site-packages/cudf/core/reshape.py:532, in melt(frame, id_vars, value_vars, var_name, value_name, col_level)
    530     missing = set(id_vars) - set(frame._column_names)
    531     if not len(missing) == 0:
--> 532         raise KeyError(
    533             f"The following 'id_vars' are not present"
    534             f" in the DataFrame: {list(missing)}"
    535         )
    536 else:
    537     id_vars = []

KeyError: "The following 'id_vars' are not present in the DataFrame: ['e', 'x', 'n', 'd', 'i']"

Expected behavior

import pandas as pd
data = {
    'A': [1, None, 3],
    'B': [None, 5, 6],
    'C': [7, 8, None]
}
df = pd.DataFrame(data)

# Reset the index to retain it
df_reset = df.reset_index()

# Melt the DataFrame while retaining the original index
melted_df = df_reset.melt(id_vars='index', var_name='column', value_name='value')

# Drop rows with NaN values
melted_df = melted_df.dropna()

# Set the original index back
melted_df = melted_df.set_index('index')

print(melted_df)

Outputs:

      column  value
index              
0          A    1.0
2          A    3.0
1          B    5.0
2          B    6.0
0          C    7.0
1          C    8.0

Environment overview (please complete the following information)

  • Environment location: Docker
  • Method of cuDF install: Docker,

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Additional context
I also tried giving it a numerical column id, a single character, and a dataframe column for kicks. All failed with expected or similar errors. While it doesn't fail when using cudf.pandas, the fallback to pandas does dramatically slows down cudf.pandas to the point where it negates many of the speed ups in your workflow

@taureandyernv taureandyernv added the bug Something isn't working label May 15, 2024
@ayushdg
Copy link
Member

ayushdg commented May 15, 2024

Looks like the issue here is that the melt api expects id_vars to be a list/tuple/ndarray type, and the check here fails to handle the case where a string is passed in.

As a workaround passing in the id_vars as a list: melted_df = df_reset.melt(id_vars=['index'], var_name='column', value_name='value') will give the expected result.

@mroeschke
Copy link
Contributor

Thanks for the report. I have a PR to fix in this issue (#15765) and should hopefully be fixed in 24.06

rapids-bot bot pushed a commit that referenced this issue May 16, 2024
closes #15758

Also fixes an inconsistency with pandas where `var_name` data was always a `Categorical` unlike pandas

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #15765
@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF/Dask/Numba/UCX May 16, 2024
@Saintfemi
Copy link

from pandas import DataFrame

data = yf.download(tickers, start=start_date, end=end_date, progress=False)

reset index to bring Date into the columns for the melt function

data2= DataFrame(data).reset_index()

print(data2.columns)

data_types =data2.dtypes

data_melted = data2.melt(id_vars= 'Date')

data_melted

Cell Outputs
MultiIndex([( 'Date', ''),
('Adj Close', 'HDFCBANK.NS'),
('Adj Close', 'INFY.NS'),
('Adj Close', 'RELIANCE.NS'),
('Adj Close', 'TCS.NS'),
( 'Close', 'HDFCBANK.NS'),
( 'Close', 'INFY.NS'),
( 'Close', 'RELIANCE.NS'),
( 'Close', 'TCS.NS'),
( 'High', 'HDFCBANK.NS'),
( 'High', 'INFY.NS'),
( 'High', 'RELIANCE.NS'),
( 'High', 'TCS.NS'),
( 'Low', 'HDFCBANK.NS'),
( 'Low', 'INFY.NS'),
( 'Low', 'RELIANCE.NS'),
( 'Low', 'TCS.NS'),
( 'Open', 'HDFCBANK.NS'),
( 'Open', 'INFY.NS'),
( 'Open', 'RELIANCE.NS'),
( 'Open', 'TCS.NS'),
( 'Volume', 'HDFCBANK.NS'),
( 'Volume', 'INFY.NS'),
( 'Volume', 'RELIANCE.NS'),
( 'Volume', 'TCS.NS')],
names=['Price', 'Ticker'])

{
"name": "KeyError",
"message": ""The following id_vars or value_vars are not present in the DataFrame: ['Date']"",
"stack": "---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[53], line 16
11 print(data2.columns)
14 data_types =data2.dtypes
---> 16 data_melted = data2.melt(id_vars= 'Date')
18 data_melted

File c:\Users\Oluwanifemi.Amao\Documents\demo-project\.venv\Lib\site-packages\pandas\core\frame.py:9942, in DataFrame.melt(self, id_vars, value_vars, var_name, value_name, col_level, ignore_index)
9932 @appender(_shared_docs["melt"] % {"caller": "df.melt(", "other": "melt"})
9933 def melt(
9934 self,
(...)
9940 ignore_index: bool = True,
9941 ) -> DataFrame:
-> 9942 return melt(
9943 self,
9944 id_vars=id_vars,
9945 value_vars=value_vars,
9946 var_name=var_name,
9947 value_name=value_name,
9948 col_level=col_level,
9949 ignore_index=ignore_index,
9950 ).finalize(self, method="melt")

File c:\Users\Oluwanifemi.Amao\Documents\demo-project\.venv\Lib\site-packages\pandas\core\reshape\melt.py:74, in melt(frame, id_vars, value_vars, var_name, value_name, col_level, ignore_index)
70 if missing.any():
71 missing_labels = [
72 lab for lab, not_found in zip(labels, missing) if not_found
73 ]
---> 74 raise KeyError(
75 "The following id_vars or value_vars are not present in "
76 f"the DataFrame: {missing_labels}"
77 )
78 if value_vars_was_not_none:
79 frame = frame.iloc[:, algos.unique(idx)]

KeyError: "The following id_vars or value_vars are not present in the DataFrame: ['Date']""
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants