-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Add a new parameter for all_integer
& all_float
to evaluate null values as True
#5136
Comments
@galipremsagar I thought we decided that we would use |
I think calling After benchmarking a few different inputs, I found some interesting results. The current Seems like we are trying to not use
So, I went ahead and made changes to str to int conversion check import cudf._lib.strings.char_types as c
%timeit c.is_integer(data._column).all()
%timeit c.all_integers(data._column)
str to float conversion check import cudf._lib.strings.char_types as c
%timeit c.is_float(data._column).all()
%timeit c.all_floats(data._column)
Note: The numbers in Bold indicate best performing cells/APIs. It seems like @kkraus14 @davidwendt @jrhemstad Could you suggest if we can pick one of the APIs? |
Sample thrust::transform_reduce bool all_float(strings_column_view const& strings,
null_policy null_handling = null_policy::EXCLUDE,
rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource(),
cudaStream_t stream = 0)
{
auto strings_column = column_device_view::create(strings.parent(), stream);
auto d_column = *strings_column;
size_type strings_count = strings.size();
return thrust::transform_reduce(
rmm::exec_policy(stream)->on(stream),
thrust::make_counting_iterator<size_type>(0),
thrust::make_counting_iterator<size_type>(strings_count),
[d_column, null_handling] __device__(size_type idx) {
if (d_column.is_null(idx))
return null_handling == null_policy::INCLUDE;
return string::is_float(d_column.element<string_view>(idx));
}
,
true,
thrust::logical_and<bool>{}
);
}
thrust::count_if bool all_float(strings_column_view const& strings,
null_policy null_handling = null_policy::EXCLUDE,
rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource(),
cudaStream_t stream = 0)
{
auto strings_column = column_device_view::create(strings.parent(), stream);
auto d_column = *strings_column;
size_type strings_count = strings.size();
return thrust::count_if(
rmm::exec_policy(stream)->on(stream),
thrust::make_counting_iterator<size_type>(0),
thrust::make_counting_iterator<size_type>(strings_count),
[d_column, null_handling] __device__(size_type idx) {
if (d_column.is_null(idx))
return null_handling == null_policy::INCLUDE;
return string::is_float(d_column.element<string_view>(idx));
}) == strings_count;
} thrust::all_of bool all_float(strings_column_view const& strings,
null_policy null_handling = null_policy::EXCLUDE,
rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource(),
cudaStream_t stream = 0)
{
auto strings_column = column_device_view::create(strings.parent(), stream);
auto d_column = *strings_column;
auto transformer_itr =
thrust::make_transform_iterator(thrust::make_counting_iterator<size_type>(0),
[d_column, null_handling] __device__(size_type idx) {
if (d_column.is_null(idx))
return null_handling == null_policy::INCLUDE;
return string::is_float(d_column.element<string_view>(idx));
});
return thrust::all_of(rmm::exec_policy(stream)->on(stream),
transformer_itr,
transformer_itr + strings.size(),
thrust::identity<bool>());
} |
My opinion is we should optimize for the expected case of valid data. We should not optimize for expecting bad data since the data can usually not be processed any further (i.e. converting to integers or floats will fail). The My recommendation is to just use the |
For the sake of reducing API footprint I would suggest we continue with |
Thanks everyone for the inputs! Closing this issue as we'll be using |
Is your feature request related to a problem? Please describe.
This is a followup feature request from the discussion that happened in #5130
From the discussion happened here, it appears to be that
all_integer
/all_float
were added as a replacement foris_integer
/is_float
+all
operations.But since
is_integer
&is_float
returnnull
when there isnull
, and when we applyall
reduction the nulls get ignored, we'd like to have similar functionality whenall_integer
/all_float
is called.Currently
all_integer
&all_float
seem to returnFalse
fornull
.Describe the solution you'd like
One solution is to have a
null_policy
parameter added to the functions and evaluate based on the inclusion/ exclusion policy.Describe alternatives you've considered
The current alternative is to do
is_interger
/is_float
+all
.Additional context
PR that has cython bindings added: #5054
The text was updated successfully, but these errors were encountered: