Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Implement merge for dataframes with decimal columns #7497

Closed
ChrisJar opened this issue Mar 3, 2021 · 2 comments · Fixed by #7764
Closed

[FEA] Implement merge for dataframes with decimal columns #7497

ChrisJar opened this issue Mar 3, 2021 · 2 comments · Fixed by #7764
Assignees
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@ChrisJar
Copy link
Contributor

ChrisJar commented Mar 3, 2021

Is your feature request related to a problem? Please describe.
I would like to be able to merge dataframes that contain columns with type decimal

Describe the solution you'd like
I would like to merge DataFrames with decimal columns the way I can currently merge DataFrames with float columns. For example:

df1 = cudf.DataFrame({'id': [0, 1, 1], 'val': [1.00, 1.01, 1.02]})
df2 = cudf.DataFrame({'id': [0, 1, 1], 'val': [8.28, 9.32, 4.94]})
df1.merge(df2, left_on=['id'], right_on=['id'], how='inner')

returns

	id	val_x	val_y
0	0	1.00	9.32
1	1	1.01	8.28
2	1	1.02	8.28
3	1	1.01	4.94
4	1	1.02	4.94

However, after casting the float columns to decimal columns

df1['val'] = cudf.Series([decimal.Decimal(x) for x in [1.00, 1.01, 1.02]], dtype=cudf.Decimal64Dtype(7,3))
df2['val'] = cudf.Series([decimal.Decimal(x) for x in [8.28, 9.32, 4.94]], dtype=cudf.Decimal64Dtype(7,3))
df1.merge(df2, left_on=['id'], right_on=['id'], how='inner')

It returns

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-19-6a6f615495ed> in <module>
----> 1 df1.merge(df2, left_on=['id'], right_on=['id'], how='inner')

/home/u00u7rh1e72hXfsipJ357/miniconda3/envs/rapids-gpu-bdb/lib/python3.7/contextlib.py in inner(*args, **kwds)
     72         def inner(*args, **kwds):
     73             with self._recreate_cm():
---> 74                 return func(*args, **kwds)
     75         return inner
     76 

/home/u00u7rh1e72hXfsipJ357/miniconda3/envs/rapids-gpu-bdb/lib/python3.7/site-packages/cudf/core/dataframe.py in merge(self, right, on, left_on, right_on, left_index, right_index, how, sort, lsuffix, rsuffix, method, indicator, suffixes)
   4239             method=method,
   4240             indicator=indicator,
-> 4241             suffixes=suffixes,
   4242         )
   4243         return gdf_result

/home/u00u7rh1e72hXfsipJ357/miniconda3/envs/rapids-gpu-bdb/lib/python3.7/site-packages/cudf/core/frame.py in _merge(self, right, on, left_on, right_on, left_index, right_index, how, sort, lsuffix, rsuffix, method, indicator, suffixes)
   3434             suffixes,
   3435         )
-> 3436         to_return = mergeop.perform_merge()
   3437 
   3438         # If sort=True, Pandas would sort on the key columns in the

/home/u00u7rh1e72hXfsipJ357/miniconda3/envs/rapids-gpu-bdb/lib/python3.7/site-packages/cudf/core/join/join.py in perform_merge(self)
    119         )
    120         result = self.out_class._from_table(libcudf_result)
--> 121         result = self.typecast_libcudf_to_output(result, output_dtypes)
    122         if isinstance(result, cudf.Index):
    123             return result

/home/u00u7rh1e72hXfsipJ357/miniconda3/envs/rapids-gpu-bdb/lib/python3.7/site-packages/cudf/core/join/join.py in typecast_libcudf_to_output(self, output, output_dtypes)
    412             if data_dtype:
    413                 output._data[data_col_lbl] = self._build_output_col(
--> 414                     data_col, data_dtype
    415                 )
    416         return output

/home/u00u7rh1e72hXfsipJ357/miniconda3/envs/rapids-gpu-bdb/lib/python3.7/site-packages/cudf/core/join/join.py in _build_output_col(self, col, dtype)
    427             )
    428         else:
--> 429             outcol = col.astype(dtype)
    430         return outcol

/home/u00u7rh1e72hXfsipJ357/miniconda3/envs/rapids-gpu-bdb/lib/python3.7/site-packages/cudf/core/column/column.py in astype(self, dtype, **kwargs)
   1027                 )
   1028             return self
-> 1029         elif np.issubdtype(dtype, np.datetime64):
   1030             return self.as_datetime_column(dtype, **kwargs)
   1031         elif np.issubdtype(dtype, np.timedelta64):

/home/u00u7rh1e72hXfsipJ357/miniconda3/envs/rapids-gpu-bdb/lib/python3.7/site-packages/numpy/core/numerictypes.py in issubdtype(arg1, arg2)
    386     """
    387     if not issubclass_(arg1, generic):
--> 388         arg1 = dtype(arg1).type
    389     if not issubclass_(arg2, generic):
    390         arg2 = dtype(arg2).type

TypeError: Cannot interpret 'Decimal64Dtype(precision=7, scale=3)' as a data type
@ChrisJar ChrisJar added Needs Triage Need team to review and classify feature request New feature or request labels Mar 3, 2021
@kkraus14 kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Mar 3, 2021
@harrism
Copy link
Member

harrism commented Mar 16, 2021

This already works in libcudf, right @codereport ?

@codereport
Copy link
Contributor

This already works in libcudf, right @codereport ?

Yes, libcudf supports this.

@rapids-bot rapids-bot bot closed this as completed in #7764 Apr 2, 2021
rapids-bot bot pushed a commit that referenced this issue Apr 2, 2021
This enables joins on decimal columns with the same precision and scale.

Closes #7497 
Depends on #7788

Authors:
  - https://github.com/ChrisJar

Approvers:
  - Keith Kraus (https://github.com/kkraus14)
  - Ashwin Srinath (https://github.com/shwina)
  - https://github.com/brandon-b-miller
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #7764
shwina pushed a commit to shwina/cudf that referenced this issue Apr 7, 2021
This enables joins on decimal columns with the same precision and scale.

Closes rapidsai#7497 
Depends on rapidsai#7788

Authors:
  - https://github.com/ChrisJar

Approvers:
  - Keith Kraus (https://github.com/kkraus14)
  - Ashwin Srinath (https://github.com/shwina)
  - https://github.com/brandon-b-miller
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: rapidsai#7764
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
5 participants