Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support for an additional parameter in is_integer, all_integer, is_float & all_float #5130

Closed
galipremsagar opened this issue May 7, 2020 · 10 comments
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)

Comments

@galipremsagar
Copy link
Contributor

Is your feature request related to a problem? Please describe.
The current APIs(is_integer, all_integer, is_float & all_float) return False if there are null values in the input, But since we actually differ with pandas in this aspect and we support null values in int & float columns, we'd like to have the above APIs return True when there is a null in the input column.

Describe the solution you'd like
From the discussion we had here: #5054 (comment) It appears to be that we'd like to have an additional parameter to support this feature.

Describe alternatives you've considered
This is currently being handled on the python side with performing a dropna and then calling the above APIs.

Additional context
This request came up as part of #5054

@galipremsagar galipremsagar added feature request New feature or request Needs Triage Need team to review and classify labels May 7, 2020
@jrhemstad
Copy link
Contributor

Why is this desired behavior? null is not an integer, right?

@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) and removed Needs Triage Need team to review and classify labels May 7, 2020
@kkraus14
Copy link
Collaborator

kkraus14 commented May 7, 2020

@jrhemstad we're using these functions to specifically check if we can typecast safely. I.E. expected behavior from the Python side is to throw if a non-integer or non-float is detected because we can't safely typecast it. On the other hand, we can safely carry over nulls without issue so the value is irrelevant.

Alternative, we could return null in cases where the value is null, which we'd then need to do a follow up operation of replace_nulls with True or False against depending on the usage.

@jrhemstad
Copy link
Contributor

expected behavior from the Python side is to throw if a non-integer or non-float is detected because we can't safely typecast it.

I don't understand why this requires returning true for is_integer(null).

If you want to ensure all of the elements in a column are valid integers before typecasting, you run is_integer(...) which returns a nullable boolean column, and then you do an any/all reduction which ignores null elements.

What am I missing?

@kkraus14
Copy link
Collaborator

kkraus14 commented May 7, 2020

I believe that currently is_integer(...) returns False on nulls instead of null. If it returns null we're good to go.

@jrhemstad
Copy link
Contributor

I believe that currently is_integer(...) returns False

Ahhhh, now this makes sense. Okay.

@galipremsagar
Copy link
Contributor Author

galipremsagar commented May 8, 2020

So, in this PR: #5054, we have used all_integers & all_floats under an understanding of them being reductions of is_integer & is_float respectively. But this seems to not be the case.

is_integer & is_float seem to return null when there is null, and when we apply all reduction the nulls get ignored.

But all_integers & all_floats seem to return False for null.

>>> import cudf
>>> import cudf._lib.strings.char_types as c
>>> s = cudf.Series(["10", np.nan, "1"])
>>> cudf.Series(c.is_integer(s._column)).all()
True
>>> s = cudf.Series(["abc", None, "1"])
>>> cudf.Series(c.is_integer(s._column))
0    False
1     null
2     True
dtype: bool
>>> s = cudf.Series(["10", None, "1"])
>>> cudf.Series(c.is_integer(s._column)).all()
True
>>> c.all_integers(s._column)
False

>>> cudf.Series(c.is_float(s._column))
0    True
1    null
2    True
dtype: bool
>>> cudf.Series(c.is_float(s._column)).all()
True
>>> c.all_floats(s._column)
False

So, if this is expected behavior of all_floats & all_integers we'll fall back to using is_integer + all/ is_float + all.

@kkraus14
Copy link
Collaborator

kkraus14 commented May 8, 2020

Happy to be incorrect here then! Seems like we can just use is_float(column).all() and is_integer(column).all() and we can likely remove the all_floats and all_integers from libcudf 😄.

@galipremsagar
Copy link
Contributor Author

Happy to be incorrect here then! Seems like we can just use is_float(column).all() and is_integer(column).all() and we can likely remove the all_floats and all_integers from libcudf 😄.

Apologies for the confusion Keith & Jake.

Shall we close this issue then? or wait for comments from @jrhemstad ?

cc: @davidwendt

@kkraus14
Copy link
Collaborator

kkraus14 commented May 8, 2020

@galipremsagar I think if we're good to go with using is_integer and is_float along with a reduction we can close this issue out.

Maybe raise an issue that we should consider removing all_floats and all_integers as they're unneeded.

@galipremsagar
Copy link
Contributor Author

Sure 👍 Closing this FEA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)
Projects
None yet
Development

No branches or pull requests

3 participants