Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Month addition or subtraction is inaccurate #6754

Closed
roe246 opened this issue Nov 12, 2020 · 6 comments · Fixed by #6775
Closed

[FEA] Month addition or subtraction is inaccurate #6754

roe246 opened this issue Nov 12, 2020 · 6 comments · Fixed by #6775
Assignees
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@roe246
Copy link

roe246 commented Nov 12, 2020

I wish I could use cuDF to do month addition or subtraction accurately, because there could be 30, 31, 28 and 29 days in a month.

The perfect feature would take a column of datetime variable to add or substract any unit of months to be a new column, in the most clean and simple way to code and run this manipulation.

For example,
DF = {'id': ['a','b','c'], 'old_date': ['2019-11-01', '2019-12-01', '2020-01-01']}
month_add = 1
I need DF['new_date'] = DF['old_date'] + month_add
so
DF = {'id': ['a','b','c'], 'old_date': ['2019-11-01', '2019-12-01', '2020-01-01'], 'new_date': ['2019-12-01', '2020-01-01', '2020-02-01']}

In order to work around, I have to convert datetime to string and work on year and month separately and do the manipulation. A lot of extra time to breakdown single digit vs double digit month dataframes independently to process the correct datetime format and append dfs back together. ALso, single vs double digit month cannot be uniformly calculated and concatenated to YYYY-MM-DD format, like ‘2020-1-01’ and not ‘2020-01-01’ correctly

Pain points -

  1. np.timedelta(month=n) does not consider the occurrence of 28,29,30,31 days in any month, but adds a month in terms of average number of days per month, a problem in numpy datetime calculation

  2. dateutil.relativedelta(months=+n) does not work with RAPIDS due to issue broadcasting this specific package/function

  3. Calculating ‘YYYY’ & ‘MM’ separately and concatenating strings back to ‘YYYY-MM-01’ would cause ‘MM’ as ‘M’ when MM<10, so we had to distinguish single M vs double MM dfs and process ad-hoc to add the ‘0’ back to single ‘M’

  4. This approach is extremely slow bc of breaking down df and appending df back together, especially when scaled up or expanding the cudf based on other columns

@roe246 roe246 added Needs Triage Need team to review and classify feature request New feature or request labels Nov 12, 2020
@jrhemstad
Copy link
Contributor

@beckernick
Copy link
Member

I believe pandas exposes this functionality as a module level function with pd.DateOffset

@brandon-b-miller
Copy link
Contributor

I can look into plumbing the libcudf function through here.

@brandon-b-miller brandon-b-miller self-assigned this Nov 13, 2020
@roe246
Copy link
Author

roe246 commented Nov 13, 2020

I can look into plumbing the libcudf function through here.

Thanks Brandon! So just to be clear, this C++ functionality is not available in Python, right? Our team is a quant team and does not have skillset to look into C++ so it'd be awesome if Python can have the same functionality!

@brandon-b-miller
Copy link
Contributor

Right - we'd write a cuDF python API that'd be close (if not identical) to pandas and produce cython bindings that call the c++ under the hood. That said this is at the "seems like will theoretically work" phase of development and I have not at all scoped out what caveats there might be to this.

@kkraus14 kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Nov 13, 2020
rapids-bot bot pushed a commit that referenced this issue Dec 15, 2020
Implements `cudf.DateOffset` - an object used for calendrical arithmetic, similar to pandas.DateOffset - for month units only. 

Closes #6754

Authors:
  - brandon-b-miller <[email protected]>
  - brandon-b-miller <[email protected]>
  - Keith Kraus <[email protected]>

Approvers:
  - GALI PREM SAGAR
  - Keith Kraus
  - Keith Kraus

URL: #6775
@brandon-b-miller
Copy link
Contributor

hi @roe246 , this should be available in the coming nightlies as cudf.DateOffset. Let us know if this works out for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants