Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Assigning scalar boolean to a Series w/ nulls results in wrong data type #9337

Closed
randerzander opened this issue Sep 29, 2021 · 3 comments · Fixed by #9803
Closed

[BUG] Assigning scalar boolean to a Series w/ nulls results in wrong data type #9337

randerzander opened this issue Sep 29, 2021 · 3 comments · Fixed by #9803
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@randerzander
Copy link
Contributor

randerzander commented Sep 29, 2021

Using latest nightly cudf conda packages:

Assigning w/ booleans at DF creation time works as expected:

>>> df = cudf.DataFrame({'val': [True, None, False]})
>>> df.dtypes
val     bool
dtype: object

But assigning scalar after initializing w/ all nulls gives an unexpected float64 dtype where I'd expect a bool:

>>> df = cudf.DataFrame({'val': [None, None, None]})
>>> df.dtypes
val    float64
dtype: object

Pandas's behavior:

>>> pdf = pd.DataFrame({'val': [None, None, None]})
>>> pdf
    val
0  None
1  None
2  None
>>> pdf.dtypes
val    object
dtype: object
>>> pdf['val'] = False
>>> pdf
     val
0  False
1  False
2  False
>>> pdf.dtypes
val    bool
dtype: object
@randerzander randerzander added bug Something isn't working Python Affects Python cuDF API. labels Sep 29, 2021
@beckernick
Copy link
Member

beckernick commented Sep 29, 2021

From offline discussion, the following example illustrates the discrepancy in coercing (or not coercing) from the original default dtype to bool.

import cudf
import pandas as pddf = cudf.DataFrame({'val': [None, None, None]})
print(df.val.dtype)
df["val"] = True
print(df)
print(df.val.dtype, "\n")
​
df = pd.DataFrame({'val': [None, None, None]})
print(df.val.dtype)
df["val"] = True
print(df)
print(df.val.dtype)

# cuDF
float64
   val
0  1.0
1  1.0
2  1.0
float64 

# pandas
object
    val
0  True
1  True
2  True
bool

@shwina
Copy link
Contributor

shwina commented Sep 30, 2021

As Nick mentioned, the crux of the problem is that we default to float as the type of an "all-nulls" column, where Pandas defaults to object:

In [8]: cudf.Series([None, None])
Out[8]:
0    <NA>
1    <NA>
dtype: float64

In [10]: pd.Series([None, None])
Out[10]:
0    None
1    None
dtype: object

We could default to object too here, and this would work as expected. Although note that objectin cuDF is simply an alias for string.

In [13]: df = cudf.DataFrame({"val": cudf.Series([None, None], dtype="object")})

In [14]: df
Out[14]:
    val
0  <NA>
1  <NA>

In [15]: df["val"] = True

In [16]: df
Out[16]:
    val
0  True
1  True

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@galipremsagar galipremsagar self-assigned this Nov 22, 2021
rapids-bot bot pushed a commit that referenced this issue Dec 14, 2021
…9803)

Fixes: #9337 

- [x] This PR changes the default `dtype` of `all-nulls` column to `object` dtype from `float64` dtype. 
- [x] To make `np.nan` values read as `float` column `nan_as_null` has to be passed as `False` in `cudf.DataFrame` constructor - This change is in-line with what is already supported by `cudf.Series` constructor.
- [x] Added `has_nans` & `nan_count` property which is needed for some of the checks. 
- [x] Cached the `nan_count` since it is repeatedly used in math operations and clearing the cache in the regular `_clear_cache` call.
- [x] Fixes pytests that are going to break due to this breaking change of types.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - https://github.com/brandon-b-miller
  - Ashwin Srinath (https://github.com/shwina)

URL: #9803
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants