[BUG] Integer overflow in df.sum() #8449

esnvidia · 2021-06-07T18:53:08Z

Integer overflow occurs when calculating a sum over a dataframe column, but not in cupy

Steps/Code to reproduce bug

cupy.random.seed(314)
cupy.random.randint(2, size=10000, dtype=cupy.int8).sum()

Out: 5058

cupy.random.seed(314)
cudf.DataFrame(cupy.random.randint(2, size=10000, dtype=cupy.int8)).sum()
Out: 0 -62
dtype: int8

Expected behavior
Out: 5058

Environment overview (please complete the following information)
conda install of rapids-0.19

The text was updated successfully, but these errors were encountered:

shwina · 2021-06-07T21:05:41Z

This is because, unlike CuPy, cuDF does not always assume int64 as the result type. Although maybe we should -- as this can lead to some hard-to-find bugs.

As a workaround, you could specify the result type:

In [32]: cupy.random.seed(314)
    ...: cudf.DataFrame(cupy.random.randint(2, size=10000, dtype=cupy.int8)).sum(dtype="int64")
Out[32]:
0    5058
dtype: int64

brandon-b-miller · 2021-06-07T21:19:34Z

Pandas yields an int64 it seems to matter what the dtype of the input column is:

>>> x = pd.Series([0], dtype='uint8')
>>> x.sum()
0
>>> type(x.sum())
<class 'numpy.int64'>

However it seems to have no actual guards against overflow either

>>> x = pd.Series([np.iinfo('int64').max, np.iinfo('int64').max])
>>> x
0    9223372036854775807
1    9223372036854775807
dtype: int64
>>> x.sum()
-2

IMO what we do about this depends on if we'd have to internally make an upcasted copy of the data to do the reduction over, vs if libcudf's output type provides at workaround for this. heres the header. From this it looks like it's clever enough to just accumulate the result into an int64 or something else if requested but this should be tested in case my reading of it isn't right.

shwina · 2021-06-07T21:24:07Z

We can expect libcudf to do the right thing here. From Python, we should specify int64 as the result type, depending on the reduction.

brandon-b-miller · 2021-06-07T21:30:09Z

That seems sensible to me, just wanted to make sure that wasn't translating to copy->cast->reduce, which it looks like it isn't.

esnvidia · 2021-06-08T13:56:58Z

Sounds good, it seems like it would also give issues with floats as well.

Related issue, but in cupy:

cupy.asarray(100*cupy.random.rand(10000), dtype=cupy.float16).sum()

Out: array(inf, dtype=float16)

cudf.DataFrame(cupy.asarray(100*cupy.random.rand(10000), dtype=cupy.float16)).sum()

Out: 0 494430.96875
dtype: float32

So sum of floats seem to cast to float32, Making the sequence larger I get inf again

cudf.DataFrame(cupy.asarray(100000*cupy.random.rand(100000000), dtype=cupy.float16)).sum()

0 inf
dtype: float32

Although this works when specifying sum dtype, which is very odd:

cudf.DataFrame(cupy.asarray(100000*cupy.random.rand(100000000), dtype=cupy.float32)).sum(dtype=cupy.float32)

0 4.999604e+12
dtype: float32

github-actions · 2021-11-15T21:03:47Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Moving this casting logic to python and updating it so that integer sum and product operations give back an `int64` and give back the original column dtype in float cases. This is a breaking change. Closes #8449 Authors: - https://github.com/brandon-b-miller Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Ashwin Srinath (https://github.com/shwina) URL: #9717

esnvidia added Needs Triage Need team to review and classify bug Something isn't working labels Jun 7, 2021

shwina added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Jun 7, 2021

brandon-b-miller self-assigned this Jun 10, 2021

github-actions bot added the inactive-90d label Nov 15, 2021

brandon-b-miller mentioned this issue Nov 17, 2021

Match pandas scalar result types in reductions #9717

Merged

rapids-bot bot closed this as completed in #9717 Jan 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Integer overflow in df.sum() #8449

[BUG] Integer overflow in df.sum() #8449

esnvidia commented Jun 7, 2021 •

edited

Loading

shwina commented Jun 7, 2021

brandon-b-miller commented Jun 7, 2021

shwina commented Jun 7, 2021

brandon-b-miller commented Jun 7, 2021

esnvidia commented Jun 8, 2021

github-actions bot commented Nov 15, 2021

[BUG] Integer overflow in df.sum() #8449

[BUG] Integer overflow in df.sum() #8449

Comments

esnvidia commented Jun 7, 2021 • edited Loading

shwina commented Jun 7, 2021

brandon-b-miller commented Jun 7, 2021

shwina commented Jun 7, 2021

brandon-b-miller commented Jun 7, 2021

esnvidia commented Jun 8, 2021

github-actions bot commented Nov 15, 2021

esnvidia commented Jun 7, 2021 •

edited

Loading