Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Integer overflow in df.sum() #8449

Closed
esnvidia opened this issue Jun 7, 2021 · 6 comments · Fixed by #9717
Closed

[BUG] Integer overflow in df.sum() #8449

esnvidia opened this issue Jun 7, 2021 · 6 comments · Fixed by #9717
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@esnvidia
Copy link

esnvidia commented Jun 7, 2021

Integer overflow occurs when calculating a sum over a dataframe column, but not in cupy

Steps/Code to reproduce bug

cupy.random.seed(314)
cupy.random.randint(2, size=10000, dtype=cupy.int8).sum()

Out: 5058

cupy.random.seed(314)
cudf.DataFrame(cupy.random.randint(2, size=10000, dtype=cupy.int8)).sum()
Out: 0 -62
dtype: int8

Expected behavior
Out: 5058

Environment overview (please complete the following information)
conda install of rapids-0.19

@esnvidia esnvidia added Needs Triage Need team to review and classify bug Something isn't working labels Jun 7, 2021
@shwina
Copy link
Contributor

shwina commented Jun 7, 2021

This is because, unlike CuPy, cuDF does not always assume int64 as the result type. Although maybe we should -- as this can lead to some hard-to-find bugs.

As a workaround, you could specify the result type:

In [32]: cupy.random.seed(314)
    ...: cudf.DataFrame(cupy.random.randint(2, size=10000, dtype=cupy.int8)).sum(dtype="int64")
Out[32]:
0    5058
dtype: int64

@shwina shwina added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Jun 7, 2021
@brandon-b-miller
Copy link
Contributor

Pandas yields an int64 it seems to matter what the dtype of the input column is:

>>> x = pd.Series([0], dtype='uint8')
>>> x.sum()
0
>>> type(x.sum())
<class 'numpy.int64'>

However it seems to have no actual guards against overflow either

>>> x = pd.Series([np.iinfo('int64').max, np.iinfo('int64').max])
>>> x
0    9223372036854775807
1    9223372036854775807
dtype: int64
>>> x.sum()
-2

IMO what we do about this depends on if we'd have to internally make an upcasted copy of the data to do the reduction over, vs if libcudf's output type provides at workaround for this. heres the header. From this it looks like it's clever enough to just accumulate the result into an int64 or something else if requested but this should be tested in case my reading of it isn't right.

@shwina
Copy link
Contributor

shwina commented Jun 7, 2021

We can expect libcudf to do the right thing here. From Python, we should specify int64 as the result type, depending on the reduction.

@brandon-b-miller
Copy link
Contributor

That seems sensible to me, just wanted to make sure that wasn't translating to copy->cast->reduce, which it looks like it isn't.

@esnvidia
Copy link
Author

esnvidia commented Jun 8, 2021

Sounds good, it seems like it would also give issues with floats as well.

Related issue, but in cupy:

cupy.asarray(100*cupy.random.rand(10000), dtype=cupy.float16).sum()

Out: array(inf, dtype=float16)

cudf.DataFrame(cupy.asarray(100*cupy.random.rand(10000), dtype=cupy.float16)).sum()

Out: 0 494430.96875
dtype: float32

So sum of floats seem to cast to float32, Making the sequence larger I get inf again

cudf.DataFrame(cupy.asarray(100000*cupy.random.rand(100000000), dtype=cupy.float16)).sum()

0 inf
dtype: float32

Although this works when specifying sum dtype, which is very odd:

cudf.DataFrame(cupy.asarray(100000*cupy.random.rand(100000000), dtype=cupy.float32)).sum(dtype=cupy.float32)

0 4.999604e+12
dtype: float32

@brandon-b-miller brandon-b-miller self-assigned this Jun 10, 2021
@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

rapids-bot bot pushed a commit that referenced this issue Jan 11, 2022
Moving this casting logic to python and updating it so that integer sum and product operations give back an `int64` and give back the original column dtype in float cases. This is a breaking change.

Closes #8449

Authors:
  - https://github.com/brandon-b-miller

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Ashwin Srinath (https://github.com/shwina)

URL: #9717
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants